Monday Sep 21, 2009

Haskell (GHC) on UltraSPARC T2

Ben Lippmeier gave an excellent presentation at the recent Haskell conference in Edinburgh on his work on porting the Glasgow Haskell Compiler (GHC) back to SPARC. A video of the talk is available.

Update:Link to slides

Monday Jul 28, 2008

Atomic operations

Solaris 10 provides atomic operations in libc. Atomic operations are sequences of instructions that behave as if they are a single atomic instruction that no other thread can interrupt. The way that the operations are implemented uses the compare-and-swap instruction (cas). A rough outline is as follows:

  Load existing value
  New value = existing value + increment
  return value = compare and swap(existing value and new value)
while (return value != existing value)

The compare-and-swap instruction atomically swaps the value in a register with the value held in memory, but only if the value held in memory equals the value held in another register. The return value is the value held in memory. The pseudo-code uses this so that the value stored back to memory will only be the incremented value if the current value in memory is equal to the value before the increment. If the value held in memory is not the one that was expected, the code retries the operation until it does succeed.

Breaking the compare-and-swap instruction into pseudo code looks like:

  CAS (Value held in memory, Old value, New value)
    Existing value = \*Value held in memory;
    if (Existing value == Old value)
       \*Value held in memory = New value;
       return Old value;
       return \*Value held in memory;

One of the questions from last week was how to use atomic operations in Solaris 9 if they are only available on Solaris 10. The answer is to write your own. The following code snippet demonstrates how to write an atomic increment operation:

.inline atomic_add,8
  ld [%o0],%o2         /\* Load existing value            \*/
  add %o2, %o1, %o3    /\* Generate new value             \*/
  cas [%o0],%o2,%o3    /\* Swap memory contents           \*/
  cmp %o2, %o3         /\* Compare old value with return  \*/
  bne 1b               /\* Fail if the old value does not \*/
                       /\* equal the value returned from  \*/
                       /\* memory                         \*/
  mov %o3,%o2          /\* Retry using latest value from  \*/
                       /\* memory                         \*/

It's probably a good idea to read this article on atomic operations on SPARC, and it may also be useful to read up on inline templates.

Wednesday May 07, 2008

When to use membars

membar instructions are SPARC assembly language instructions that enforce memory ordering. They tell the processor to ensure that memory operations are completed before it continues execution. However, the basic rule is that the instructions are usually only necessary in "unusual" circumstances - which fortunately will mean that most people don't encounter them.

The UltraSPARC Architecture manual documents the situation very well in section 9.5. It gives these rules which cover the default behaviour:

  • Each load instruction behaves as if it were followed by a MEMBAR #LoadLoad and #LoadStore.
  • Each store instruction behaves as if it were followed by a MEMBAR #StoreStore.
  • Each atomic load-store behaves as if it were followed by a MEMBAR #LoadLoad, #LoadStore, and #StoreStore.

There's a table in section 9.5.3 which covers when membars are necessary. Basically, membars are necessary for ordering of block loads and stores, and for ordering non-cacheable loads and stores. There is an interesting note where it indicates that a membar is necessary to order a store followed by a load to a different addresses; if the address is the same the load will get the correct data. This at first glance seems odd - why worry about whether the store is complete if the load is of independent data. However, I can imagine this being useful in situations where the same physical memory is mapped using different virtual address ranges - not something that happens often, but could happen in the kernel.

As a footnote, the equivalent x86 instruction is the mfence. There's a good discussion of memory ordering in section 7.2 of the Intel Systems Programming Guide.

There's some more discussion of this topic on Dave Dice's weblog.

Friday Mar 14, 2008

32-bits good, 64-bits better?

One of the questions people ask is when to develop 64-bit apps vs 32-bit apps. The answer is not totally clear cut, it depends on the application and the platform. So here's my take on it.

First, let's discuss SPARC. The SPARC V8 architecture defines the 32-bit SPARC ISA (Instruction Set Architecture). This was found on a selection of SPARC processors that appeared before the UltraSPARC I, ie quite a long time ago. Later the SPARC V9 architecture appeared. (As an aside, these are open standards, so anyone can download the specs and make one, of course not everyone has the time and resources to do that, but it's nice in theory ;)

The SPARC V9 architecture added a few instructions, but mainly added the capability to use 64-bit addressing. The ABI (Application Binary Interface) was also improved (floating point values passed in FP registers rather than the odd V8 choice of using the integer registers). The UltraSPARC I and onwards have implemented the V9 architecture, which means that they can execute both V8 and V9 binaries.

One of the things that it's easy to get taken in by is the animal farm idea that if 32-bits is good 64-bits must be better. The trouble with 64-bit address space is that it takes more instructions to set up an address, pointers go from 4 bytes to 8 bytes, and the memory footprint of the application increases.

A hybrid mode was also defined, which took the 32-bit address space, together with a selection of the new instructions. This was called v8plus, or more recently sparcvis. This has been the default architecture for SPARC binaries for quite a while now, and it combines the smaller memory footprint of SPARC V8 with the more recent instructions from SPARC V9. For applications that don't require 64-bit address space, v8plus or sparcvis is the way to go.

Moving to the x86 side things are slightly more complex. The high level view is similar. You have the x86 ISA, or IA32 as its been called. Then you have the 64-bit ISA, called AMD64 or EMT64. EMT64 gives you both 64-bit addressing, a new ABI, a number of new instructions, and perhaps most importantly a bundle of new registers. The x86 has impressively few registers, EMT64 fixes that quite nicely.

In the same way as SPARC, moving to a 64-bit address space does cost some performance due to the increased memory footprint. However, the x64 gains a number of additional registers, which usually more than make up for this loss in performance. So the general rule is that 64-bits is better, unless the application makes extensive use of pointers.

Unlike SPARC, EMT64 does not currently provide a mode which gives the instruction set extensions and registers with a 32-bit address space.

Friday Mar 07, 2008

Flush register windows

The SPARC architecture has an interesting feature called Register Windows. The idea is that the processor should contain multiple sets of registers on chip. When a new routine is called, the processor can give a fresh set of registers to the new routine, preserving the value of the old registers. When the new routine completes and control returns to the calling routine, the register values for the old routine are also restored. The idea is for the chip not to have to save and load the values held in registers whenever a routine is called; this reduces memory traffic and should improve performance.

The trouble with register windows, is that each chip can only hold a finite number of them. Once all the register windows are full, the processor has to spill a complete set of registers to memory. This is in contrast with the situation where the program is responsible for spilling and filling registers - the program only need spill a single register if that is all that the routine requires.

Most SPARC processors have about seven sets of register windows, so if the program remains in a call stack depth of about seven, there is no register spill/fill cost associated with calls of other routines. Beyond this stack depth, there is a cost for the spills and fills of the register windows.

The SPARC architecture book contains a more detailed description of register windows in section 5.2.2.

Most of the time software is completely unaware of this architectural decision, in fact user code should never have to be aware of it. There are two situations where software does need to know about register windows, these really only impact virtual machines or operating systems:

  • Context switches. In a context switch the processor changes to executing another software thread, so all the state from that thread needs to be saved for the thread to later resume execution. Note that setjmp and longjmp which are sometimes used as part of code to implement context switching already have the appropriate flushes in them.
  • Garbage collection. Garbage collection involves inspecting the state of the objects held in memory and determining whether each object is live or dead. Live objects are identified by having other live objects point to them. So all the registers need to be stored in memory so that they can be inspected to check whether they point to any objects that should be considered live.

The SPARC V9 instruction flushw will cause the processor to store all the register windows in a thread to memory. For SPARC V8, the same effect is attained through trap x03. Either way, the cost can be quite high since the processor needs to store up to about 7 sets of register windows to memory Each set is 16 8-byte registers, which results in potentially a lot of memory traffic and cycles.

Tuesday Oct 16, 2007

Building shared libraries for SPARCV9

By default, the SPARC compiler assumes that SPARCV9 objects are built with -xcode=abs44, which means that 44 bits are used to hold the absolute address of any object. Shared libraries should be built using position independent code, either -xcode=pic13 or -xcode=pic32 (replacing the deprecated -Kpic and -KPIC options.

If one of the object files in a library is built with abs44, then the linker will report the following error:

ld: fatal: relocation error: R_SPARC_H44: file file.o: symbol <unknown>:
relocations based on the ABS44 coding model can not be used in building
a shared object
Further details on this can be found in the compiler documentation.

Tuesday Oct 09, 2007

CMT Developer Tools on the UltraSPARC T2 systems

The CMT Developer Tools are included on the new UltraSPARC T2 based systems, together with Sun Studio 12, and GCC for SPARC Systems.

The CMT Developer Tools are installed in:

and are (unfortunately) not on the default path.

Tags and the UltraSPARC T2

Click here to see what others have tagged with 't2' (or ultraSPARC T2) (or CMT).

Here's Allan Packer's post summarising some of the blog posts about the UltraSPARC T2.

Tuesday Sep 25, 2007

Solaris Application Programming book

I'm thrilled to see that my book is being listed for pre-order on Amazon in the US. It seems to take about a month for it to travel the Atlantic to Amazon UK.

Tuesday Aug 07, 2007

UltraSPARC T2 documentation available

The documentation for the UltraSPARC T2 is available from the OpenSPARC website.


Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge


« April 2014
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming