By Darryl Gove on May 22, 2008
On SPARC there's a slight complication. The load and store instructions have an offset range of -4096 to +4096. To use a larger offset than that it is necessary to put the offset into a register and use that to calculate the address.
If the size of the local variables are less than 4KB, then a load or store instruction can use the frame pointer together with an offset in order to access the memory on the stack. If the stack is greater than 4KB, then it's possible to use the frame pointer to access memory in the upper 4KB range, and the stack pointer to access memory in the lower 4KB. Rather like this diagram shows:
frame pointer -> top of stack \^ | Upper 4KB can be accessed v using offset+ frame pointer \^ | Lower 4KB can be accessed v using offset+ frame pointer stack pointer -> bottom of stack
The complication is when temporary memory is allocated on the stack using alloca, and the size of the local variables exceed 4KB. In this case it's not possible to just shift the stack pointer downwards - since that may cause variables that were previously accessed through the stack pointer to become out of the 4KB offset range, or it would change the offset from the stack pointer where variables are stored (by an amount which may only be known at runtime). Either of these situations would not be good.
Instead of just shifting the stack pointer, a slightly more complex operation has to be carried out. The memory gets allocated in the middle of the range, and the lower memory gets shifted (or copied) downwards. The end result is something like this:
frame pointer -> top of stack \^ | Upper 4KB can be accessed v using offset+ frame pointer [Alloca'd memory] \^ | Lower 4KB can be accessed v using offset+ frame pointer stack pointer -> bottom of stack
The routine that does this manipulation of memory is called __builtin_alloca. You can see in the code that it moves the stack pointer, and then has a copy loop to move the contents of the stack.
Unfortunately, the need to copy the data means that it takes longer to allocate memory. So if the function
__builtin_alloca appears in a profile, the first thing to do is to see whether it's possible to reduce the amount of local variables/stack space needed for the routine.
As a footnote, take a look at the equivalent code for the x86 version of __builtin_alloca. The x86, being CISC, does not have the limit on the size of the offset that can be used. Hence the x86 code does not need the copy routine to move variables in the stack around.