Floating point multiple accumulate
By Darryl Gove on Jun 05, 2008
The SPARC64 VI processor supports floating point multiply accumulate instructions, also known as FMA or FMAC. These instructions take 3 input registers and one output register and perform the calculation:
dest = src1\*src2+src3
The advantage of these instructions is that they perform the multiply and add operations in the same time as it normally takes to do either a multiply or an add - so the issue rate of floating point instructions is potentially doubled. To see how they can do this, look at the following binary multiply example:
1010 \* 111 ---- 1010 10100 + 101000 ------- 1000110
You can see that the multiply is really a sequence of adds. Adding one more addition into the process does not really make much difference.
However, we're really dealing with floating point numbers. So consider the usual sequence of operations when performing a multiplication followed by an addition:
temp = round(src1 \* src2) dst = round(src3 + temp)
The floating point numbers are typically computed in hardware with additional precision, then rounded to fit the register. A fused FMA is equivalent to the following operations:
dst = round(src1 \* src2 + src3)
Which performs rounding only at the end of the FMA operation. The single rounding operation during the computation may cause a difference in the least significant bits of the result when compared with the result from using two rounding operations. In theory, because the hardware does the computation in higher precision, and then rounds down, the result should be more accurate. It is possible to support an unfused FMA operation, where the middle rounding step is preserved, and the result is identical to the two step code.
The SPARC64 VI processor supports fused FMA. As we've discussed, using this instruction may cause minor differences to the output, so you need to explicitly give permission to the compiler to generate it (by default the compiler avoids doing anything that would alter the results of FP computation). So the flags necessary to generate fused FMA instructions are:
A binary compiled to use the FMA instructions will not run on hardware that does not support these instructions. With that in mind it is also acceptable to specify the target processor in the flags, which avoids needing to specify the architecture: