X

GCC inline assembly, part 2

Guest Author

Long ago, I promised to write more about gcc inline assembly, in
particular a few cases that are tricky to get right. Here, somewhat
belatedly, are those cases. These examples are taken from libc, but the
concepts apply to any inline assembly fragments you write for gcc. As I
mentioned href="http://blogs.sun.com/roller/page/wesolows?entry=the_first_opensolaris_project_gcc1">previously,
these concerns apply only to gcc-style inlines; the Studio-style inline
format doesn't require that you use this same level of caution. gcc
expects you to write assembly fragments (even in a "separate" inline
function) as if they are logically a part of the caller. That is, the
compiler will allocate registers or other appropriate storage locations
to each of the input and output C variables. This requires that you
instruct the compiler very carefully as to your use of each variable,
and the variables' relationships to one another. The advantage is much
better register allocation; the compiler is free to allocate whatever
registers it wishes to your input and output variables in a manner that
is transparent to you. Instead, Studio requires that you code the
fragment as if it were a leaf function, so the compiler does not do any
register allocation for you. You are permitted to use the caller-saved
registers any way you wish, and even to use the caller's stack as if you
are in a leaf function. Arguments and return values are stored in their
ABI-defined locations. Depending on the optimization level you use,
this can be wasteful of registers (though the peephole optimizer can
often clean up some of this waste) and can also make writing the
fragment much more difficult. In exchange, however, you don't have to
be nearly as careful to express the fragment's operation to the
compiler.

Inputs, Outputs, and Clobbers (oh my!)

Each assembly fragment may have any or all of outputs, inputs, and
clobbers. Each input and output maps a C variable or literal to a
string suitable for use as an assembly operand. These operands can then
be referenced as %0, %1, %2, etc.
These are ordered beginning from 0 with the first output, followed by
the inputs. Alternately, newer versions of gcc allow the use of
symbolic names for each input and output. Clobbers are somewhat
different; they express the set of registers and/or memory whose values
are changed by the fragment but are not expressed in the outputs.
Inputs which are also changed must be listed as outputs, not clobbers.
Normally, the clobbers include explicit registers used by certain
instructions, but may also include "cc" to indicate that
the condition code registers are modified and/or "memory"
to indicate that arbitrary memory addresses have had their contents
altered.

Constraints

Outputs and inputs are expressed as constraints, in a language
specifying the type of operand that will contain the value of a
variable. Common constraints include "r", indicating that
a general register should be allocated, and "m" indicating
that some type of memory location should be used. The complete list of
constraints is found in href="http://gcc.gnu.org/onlinedocs/gcc-3.4.5/gcc/Constraints.html#Constraints">the
gcc documentation. These constraints may contain modifiers, which
give gcc more information about how the operand will be used. The most
common modifiers are "=", "+", and
"&". The "=" modifier is used to indicate
that the operand is output-only; it may appear only in the constraint
for an output variable. Even if the constraint is applied to a variable
containing an existing value in your program, there is no guarantee that
it will contain that value when your assembly fragment is executed. If
you need that, you must use the "+" modifier instead of
"="; this tells the compiler that this operand is both an
input and an output. Nevertheless, the variable with this constraint is
provided only in the outputs section of the fragment's specification.
An alternate way to express the same thing is provided in the
documentation. Note that providing the same variable as both an input
and an output does not guarantee you that the same location (register,
address, etc.) will be used for both of them. Thus the following is
generally incorrect:

static inline int
add(int var1, int var2)
{

__asm__(

"add

%2, %0"

: "=r" (var1)

: "r" (var1), "r" (var2));

return (var1);
}

The "&" modifier is used on an output operand whose value
is overwritten before all the input operands are consumed. This
exists to prevent gcc from using the same register for both the input
and output operands. For example, for href="http://cvs.opensolaris.org/source/xref/on/usr/src/lib/libc/inc/thr_inlines.h">swap32()
(see also href="http://cvs.opensolaris.org/source/xref/on/usr/src/lib/libc/sparc/threads/sparc.il#68">the
Studio inline function), we might think to write:
extern __inline__ uint32_t
swap32(volatile uint32_t \*__memory, uint32_t __value)
{

...

uint32_t __tmp1, __tmp2;

__asm__ __volatile__(

"ld [%3], %1\\n\\t"

"1:\\n\\t"

"mov %0, %2\\n\\t"

"cas [%3], %1, %2\\n\\t"

"cmp %1, %2\\n\\t"

"bne,a,pn %%icc, 1b\\n\\t"

" mov %2, %1"

: "+r" (__value), "=r" (__tmp1), "=r" (__tmp2)

: "r" (__memory)

: "cc");

return (__tmp2);
}

But suppose gcc decided to allocate o0 to both
__tmp1 and __memory. This is allowable,
because the "=r" constraint implies that the corresponding
register is set only after all input-only operands are no longer needed
(input/output operands obviously don't have this problem). In the case
above, the first load would clobber o0 and the
cas would operate on an arbitrary location. Instead, we
must write "=&r" for both __tmp1 and
__tmp2; neither variable may safely be allocated the same
register as the input operand.

Bugs caused by omitting the earlyclobber are painful to track down
because they often appear and disappear from one compilation to the
next as entirely unrelated code changes cause increases or decreases
in register pressure.

This is not an academic concern. Consider this example program:

#include 
static __inline__ void
incr32(volatile uint32_t \*__memory)
{
uint32_t __tmp1, __tmp2;
__asm__ __volatile__(
"ld [%2], %0\\n\\t"
"1:\\n\\t"
"add %0, 1, %1\\n\\t"
"cas [%2], %0, %1\\n\\t"
"cmp %0, %1\\n\\t"
"bne,a,pn %%icc, 1b\\n\\t"
" mov %1, %0"
: "=r" (__tmp1), "=r" (__tmp2)
: "r" (__memory)
: "cc");
}
uint32_t
func(uint32_t x)
{
uint32_t y = 4;
uint32_t z = x + y;
incr32(&y);
z = x + y;
return (z);
}

gcc compiles this (use -O2 -mcpu=v9 -mv8plus) into:
func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 90 02 20 04 add %o0, 0x4, %o0
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: c2 00 40 00 ld [%g1], %g1

<===
func+0x18: 9a 00 60 01 add %g1, 0x1, %o5
func+0x1c: db e0 50 01 cas [%g1] , %g1, %o5

<= SEGV
func+0x20: 80 a0 40 0d cmp %g1, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 82 10 00 0d mov %o5, %g1
func+0x2c: 81 c3 e0 08 retl
func+0x30: 9c 23 bf 88 sub %sp, -0x78, %sp

In this case, gcc has allocated g1 to both
__tmp1 and __memory, and o5 to
__tmp2. Note the highlighted instructions: the initial
load destroys the value of g1, and the subsequent
cas will attempt to operate on whatever address was stored
at \*__memory when the fragment began. In this example,
that value will be 4 (g1 is assigned sp+0x64,
which is simply the address of y). This program is
compiled incorrectly due to improper constraints, and will cause a
segmentation fault if the code in question is executed.

If instead we use "=&r" for both __tmp1
and __tmp2, gcc generates the following code:

func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 90 02 20 04 add %o0, 0x4, %o0
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: d8 00 40 00 ld [%g1], %o4

<===
func+0x18: 9a 03 20 01 add %o4, 0x1, %o5
func+0x1c: db e0 50 0c cas [%g1] , %o4, %o5

<= OK
func+0x20: 80 a3 00 0d cmp %o4, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 98 10 00 0d mov %o5, %o4
func+0x2c: 81 c3 e0 08 retl
func+0x30: 9c 23 bf 88 sub %sp, -0x78, %sp

This code now assigns o4 to __tmp1, which
eliminates the problem described above. This function, however, still
does not do the right thing. Why not?

Reloading

Compilers keep track of where each live variable in the program can
be found; many variables can be found both at some memory location and
in a register. Sometimes, the compiler chooses to use a register for a
different variable, and stores the value back to its memory location (if
it has changed) before doing so. Later, if this value is needed, the
value must be loaded back into a register before being used. This is
known as reloading. Other reasons reloading may be required include a
variable's declaration as volatile and the case that
concerns us here, a variable's modification via side effects.

In the example above, incr32() is actually operating on
a memory address, not a register. So why did we assign
__memory the "r" constraint instead of more
correctly expressing the constraint as "+m" (\*__memory)?
It turns out that the "m" constraint allows a variety of
possible addressing modes. On SPARC, this includes the register/offset
mode (such as [%sp+0x64]). This is fine for instructions
like ld and st, but the cas
instruction is special: it allows no offset. No constraint exists to
describe this condition; the "V" constraint is clearly
similar but is not correct; a bare register ([%g1]) is an
offsettable address, so "V" would actually exclude the case
we want. Conversely, "o", the inverse constraint of
"V", includes the register/offset addressing mode we
specifically wish to exclude. So, the only way to express this
constraint is "r". But this does nothing to capture the
fact that although the pointer itself is not modified, the value at
\*__memory is altered by the assembly fragment. Is this a
problem? Let's look at the assembly generated for func() a
little more closely:

func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 90 02 20 04 add %o0, 0x4, %o0

<===
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: d8 00 40 00 ld [%g1], %o4
func+0x18: 9a 03 20 01 add %o4, 0x1, %o5
func+0x1c: db e0 50 0c cas [%g1] , %o4, %o5
func+0x20: 80 a3 00 0d cmp %o4, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 98 10 00 0d mov %o5, %o4
func+0x2c: 81 c3 e0 08 retl

<===
func+0x30: 9c 23 bf 88 sub %sp, -0x78, %sp

We see that gcc has assigned z the o0
register, which is not surprising given that it's the return value. But
after o0 is set to x + 4 at the beginning of
the function, it's never set again. The line z = x + y has
been discarded by the compiler! This is because it does not know that
our inline assembly modified the value of y, so it did not
reload the value and recalculate z.

There are two ways we can correct this problem: (a) add a
"+m" output operand for \*__memory, or (b) add
"memory" to the list of clobbers. This is a special
clobber that tells gcc not to trust the values in any registers it would
otherwise believe to hold the current values of variables stored in
memory. In short, this clobber tells gcc that all registers must be
reloaded if the correct value of a variable is required. This is
somewhat inefficient when we know which piece of memory has been
touched, so (a) is preferable for better performance.
Whichever solution we choose, gcc now compiles our code to:

func()
func: 9c 03 bf 88 add %sp, -0x78, %sp
func+0x4: 9a 10 20 04 mov 0x4, %o5
func+0x8: 98 10 00 08 mov %o0, %o4
func+0xc: da 23 a0 64 st %o5, [%sp + 0x64]
func+0x10: 82 03 a0 64 add %sp, 0x64, %g1
func+0x14: d6 00 40 00 ld [%g1], %o3
func+0x18: 9a 02 e0 01 add %o3, 0x1, %o5
func+0x1c: db e0 50 0b cas [%g1] , %o3, %o5
func+0x20: 80 a2 c0 0d cmp %o3, %o5
func+0x24: 32 47 ff fd bne,a,pn %icc, func+0x18
func+0x28: 96 10 00 0d mov %o5, %o3
func+0x2c: d0 03 a0 64 ld [%sp + 0x64], %o0

<===
func+0x30: 90 03 00 08 add %o4, %o0, %o0

<===
func+0x34: 81 c3 e0 08 retl
func+0x38: 9c 23 bf 88 sub %sp, -0x78, %sp

Note the reload, which will now return the correct result. There
are actually two other ways to correct this, although the use of
"+m" is the most correct. First, we could declare
z to be volatile in func(). This
would force gcc to reload its value from memory any time that value is
required. Use of the volatile keyword is mainly useful
when some external thread (or hardware) may change the value at any
time; using it as a substitute for correct constraints will cause
unnecessary reloading, degrading performance. Second, and perhaps best
of all, the compiler could be modified to accept a SPARC-specific
constraint for use with the cas instruction, one which
requires the address of the operand to be stored in a general register.

You can find more inline assembly examples in libc (href="http://cvs.opensolaris.org/source/xref/on/usr/src/lib/libc/inc/base_inlines.h">math
functions), href="http://cvs.opensolaris.org/source/xref/on/usr/src/common/crypto/md5/md5_byteswap.h">MD5
acceleration, and href="http://cvs.opensolaris.org/source/xref/on/usr/src/uts/sparc/asm/">the
kernel illustrating these concepts. Be sure to read and understand
the documentation completely before writing your own
inline assembly for gcc, and always test your understanding by
constructing and compiling simple test programs like these.

Join the discussion

Comments ( 3 )
  • ux-admin Tuesday, December 6, 2005
    Very interesting article. Keep 'em coming!
    Question #1: wouldn't it have been simpler to write straight assembler .S code and assemble it with `as`, then link it it with the rest of the C code, rather than having to play these games and dance around GCC?
    Question #2: does it make sense to fiddle with GCC in these ways when Sun Studio compilers are now free?
    Note: the most ideal case would be to have the famous Amiga ASM-One assembler IDE available on Solaris, which is of course next to impossible because the source code is in MC680xx assembler.
    Reference: http://www.euronet.nl/users/jdm/documents/asmone.html
  • Keith M Wesolowski Wednesday, December 7, 2005
    ux-admin, yes it is simpler to write assembly in a separate file and use the normal function call interface. But it's also much slower, especially since most of these functions are just a few instructions long. They really need to be inlined for performance reasons, especially the really simple functions like caller() and curthread(). Second, yes, this is worthwhile for several reasons. From a purely technical point of view, gcc catches bugs that neither cc nor lint does, and gcc will be a boon to anyone doing a port since Studio doesn't offer support for any other architectures. So using both compilers will help us keep the code portable and bug-free. From a philosophical point of view, Studio is free as in beer, but personally I'd rather not shut out people who want to use a Free compiler with a Free operating system. Perhaps at some point Studio will be open and this argument will be moot, but the technical merits would remain.
  • ux-admin Thursday, December 8, 2005
    Further improvements are being done on Sun Studio, right?
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services