### Floating point multiple accumulate

The SPARC64 VI processor supports floating point multiply accumulate instructions, also known as FMA or FMAC. These instructions take 3 input registers and one output register and perform the calculation:

```dest = src1\*src2+src3
```

The advantage of these instructions is that they perform the multiply and add operations in the same time as it normally takes to do either a multiply or an add - so the issue rate of floating point instructions is potentially doubled. To see how they can do this, look at the following binary multiply example:

```     1010
\* 111
----
1010
10100
+ 101000
-------
1000110
```

You can see that the multiply is really a sequence of adds. Adding one more addition into the process does not really make much difference.

However, we're really dealing with floating point numbers. So consider the usual sequence of operations when performing a multiplication followed by an addition:

```  temp = round(src1 \* src2)
dst  = round(src3 + temp)
```

The floating point numbers are typically computed in hardware with additional precision, then rounded to fit the register. A fused FMA is equivalent to the following operations:

``` dst = round(src1 \* src2 + src3)
```

Which performs rounding only at the end of the FMA operation. The single rounding operation during the computation may cause a difference in the least significant bits of the result when compared with the result from using two rounding operations. In theory, because the hardware does the computation in higher precision, and then rounds down, the result should be more accurate. It is possible to support an unfused FMA operation, where the middle rounding step is preserved, and the result is identical to the two step code.

The SPARC64 VI processor supports fused FMA. As we've discussed, using this instruction may cause minor differences to the output, so you need to explicitly give permission to the compiler to generate it (by default the compiler avoids doing anything that would alter the results of FP computation). So the flags necessary to generate fused FMA instructions are:

```-xarch=sparcfmaf -xfma=fused
```

A binary compiled to use the FMA instructions will not run on hardware that does not support these instructions. With that in mind it is also acceptable to specify the target processor in the flags, which avoids needing to specify the architecture:

```-xtarget=sparc64vi -xfma=fused
```

It is sad that IEEE754 forbids this behavior. The JVM cannot use FMA, for instance.

And it is great that -fma=fused works with any rounding mode, contrarily to the x86 -xvector option.

Posted by Marc on June 05, 2008 at 09:10 PM PDT #

Comments are closed for this entry.

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

##### Archives
Sun Mon Tue Wed Thu Fri Sat « March 2015 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Today