How to learn SPARC assembly language

Got a question this morning about how to learn SPARC assembly language. It's a topic that I cover briefly in my book, however, the coverage in the book was never meant to be complete. The text in my book is meant as a quick guide to reading SPARC (and x86) assembly, so that the later examples make some kind of sense. The basics are the instruction format:

[instruction]  [source register 1], [source register 2], [destination register]

For example:

faddd    %f0, %f2, %f4


%f4 = %f0 + %f2

The other thing to learn that's different about SPARC is the branch delay slot. Where the instruction placed after the branch is actually executed as part of the branch. This is different from x86 where a branch instruction is the delimiter of the block of code.

With those basics out the way, the next thing to do would be to take a look at the SPARC Architecture manual. Which is a very detailed reference to all the software visible implementation details.

Finally, I'd suggest just writing some simple codes, and profiling them using the Sun Studio Performance Analyzer. Use the disassembly view tab and the architecture manual to see how the instructions are used in practice.


Two things:

when I was learning SPARC assembler, I absolutely hated the following (sorry about the lost formatting):

.asciz "Counter is %.2d\\n"
.align 4

sethi %hi(String0), %l0
or %l0, %lo(String0), %l0

I understand that there isn't enough space to load a 32-bit value directly, but is there a way to do the above with one or more ld (but not synthetic!) instructions? Try as I might, I couldn't figure it out; all my attempts ended up dumping core. (And yes, I understand that sethi-or combo is fast, faster than ld, but that's not the point.)

That sethi-or combination is just horrible, absolutely horrible. To give you some context, I come from MC680xx background, where one could just say

lea.l $dff000, a0

or even

move.l #$dff0000, a0

or even

lea.l $dff000(pc), a0

The branch delay slot is a wonderful thing, basically giving one an instruction cycle for free, with some clever/creative use.

Posted by UX-admin on November 13, 2008 at 03:43 PM PST #

I agree that it's ugly. I'd quite like to see an instruction to handle a 32-bit immediate, but that would mean data in the instruction stream.

I don't think there's an easy way of using a load to get the start address. I suspect you might be able to use a jmp to find out the pc, and then offset from the pc. But that would require data in the instruction space again.

I'm not very fond of the branch delay slot. Although it's kind of entertaining, it can be awkward to locate instructions to fill it (for some cases).


Posted by Darryl Gove on November 14, 2008 at 01:02 PM PST #

I have done x86 assembly on Linux, and I am currently beginning SPARC assembly. Yeah, I agree, the sethi + or combination of instructions is not very nice, but there are downsides to everything. However, I see many good sides to SPARC assembly

1) Operations often taking three operands, which lets you specify the destination register

2) Many registers available, much more than on the intel architecture

3) I like the concept of register windows, it gives you the ability to free registers for use inside a function

4) Passing arguments to a function through registers instead of pushing them through the stack (yeah, you do use the stack when you have more arguments than registers available for that purpose, but you still have an advantage). This leads to performance increase, and also simplify code (pushing arguments backwards on the stack and popping them out is not always nice)

5) Operations are of a fixed size, making disassembly much easier, or moving to previous / next instructions easy by the fact that they all have the same size.

I have ordered "SPARC Architecture, Assembly Language Programming, and C (2nd Edition)":

Reviews are not that good but I think half of them refer to the first edition, so I've ordered a copy anyway. I still haven't received it but you might want to take a look at it.

Posted by Marc-Andre Moreau on November 16, 2008 at 07:37 AM PST #


1. Yes, the format of the instructions is pretty clear, most disassembly is relatively easy to interpret.

2. On x86 this is addressed by the EMT64 extensions.

3. Register windows are an interesting feature. As you say, they are very helpful, the issue is that the chip has a finite number of windows. When you run out there's a software trap that spills or fills them to/from memory. Unfortunately the spill/fill has to be of all the registers, not just the ones that need to be preserved. For codes which have very low stack depth, register windows work well. For codes where the application is going up and down a large stack depth, there can be a significant cost. Check trapstat to see if this is happening to your code.

4. Again, for x86 this has been addressed through EMT64. Something that's interesting is that the SPARC v9 architecture passes FP parameters in the FP registers, but the v8 architecture (for some reason which escapes me) passes FP values in the int registers.

5. Yes, this is a really useful feature. You can disassemble an application from any arbitrary point in the code. On x86 you could be starting from the middle of an instruction.

I seem to have lost the copy I had of the Paul book. I last read it many years ago, IIRC it gave an ok summary of SPARC assembly, but had two issues, I don't think it covered SPARC v9, and it made heavy use of preprocessed macros.



Posted by Darryl Gove on November 17, 2008 at 06:12 AM PST #

About register windows and spill/fill, is there a lot of space on the chip or you have to spill/fill very often? I foresee a significant performance decrease when using a recursive method that calls itself a lot of times. Would recursion be more efficient with a stack or with register windows?

Posted by Marc-Andre Moreau on November 18, 2008 at 10:48 AM PST #

For >~7 levels of recursion, you end up doing spill and fill traps, which eat away at performance. So probably stack based recursion would be better - so long as the number of active registers is fewer than the total number of registers.


Posted by Darryl Gove on November 18, 2008 at 01:20 PM PST #

@UX-admin. You might want to read about the pseudo-ops set/setx on pg 703 of the UltraSPARC Architecture book ( These should expand to the required instructions.

Posted by Darryl Gove on November 19, 2008 at 02:29 PM PST #

"4) Passing arguments to a function through registers instead of pushing them through the stack (yeah, you do use the stack when you have more arguments than registers available for that purpose, but you still have an advantage)."

Traditionally, there have been two approaches to this:

a) push the needed data on the stack
b) pass a pointer to a struct, the pointer being the address to either an allocated, or a fixed chunk of memory.

In my experience, since the stack is in reality also just normal memory on most systems, pushing values serially actually takes more CPU cycles than accessing the struct with all the data in it (and popping data back from the stack costs additional CPU cycles). Of course, with the struct technique, the price to pay is memory latency.

The trick is to preload as much of the struct into as many registers as will fit; always try to keep the data in registers for as long as possible, doing as much work on the data as possible before "flushing" back to memory.

But as always, this requires creative thinking. Assembler never ceases to fascinate me, in the way that it actually leaves lots and lots of room for creative/clever programming.

Posted by UX-admin on November 20, 2008 at 04:59 PM PST #

"You might want to read about the pseudo-ops set/setx on pg 703"

Hey, thank you for that pointe (pg. 703), I really appreciate it!

Posted by UX-admin on November 20, 2008 at 05:01 PM PST #

I see now. Even with setx (a synthetic instruction), the result is still variation of a sethi-or combo.

So it looks like to me, that the only way to load 32- and 64-bit values from memory is to use the sethi-or.

What a strange CPU architecture.

But if that's the way to do it, then I guess I'll have to live with that.

Posted by UX-admin on November 20, 2008 at 05:07 PM PST #

Yup, with 4-byte instructions, it takes several to actually load an address. The set pseudo instructions make this easier from the perspective of the user, but still become a bundle of instructions when the code is generated.


Posted by Darryl Gove on November 21, 2008 at 04:29 AM PST #

I want to know about the exact use of assembler in a system.
And why it is depended on operating System?

Posted by Prajeesh on November 25, 2008 at 09:17 PM PST #

i want to know difference between itanium processor and x86?

Posted by prajeesh on November 25, 2008 at 09:27 PM PST #

who r u?

Posted by menaka on November 25, 2008 at 09:30 PM PST #

@Prajeesh. As you point out assembly language is processor specific, and the means of writing it can be OS or tools specific. So you need to start by identifying the platform.

The Itanium and x86 lines are completely different. So the difference is "everything". The best starting points are either wikipedia, or look through the docs on the Intel site.



Posted by Darryl Gove on November 26, 2008 at 05:39 AM PST #

Post a Comment:
Comments are closed for this entry.

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge
Free Download


« July 2016
The Developer's Edge
Solaris Application Programming
OpenSPARC Book
Multicore Application Programming