### IEEE-754: a skewed view

#### By Darryl Gove on Dec 11, 2006

IEEE-754 is the standard that covers floating point computatation. It sets out standard data sizes, as well as special values like NaN's. In many ways it is a standard that enables floating point computation to be ported between systems without concern that one systems double precision will be another systems single precision.

From my perspective one of the frustrations of the standard is that it basically prohibits optimisation of the floating point portion of an application. As an example, even a calculation whose result is never used has to be performed in case the program is relying on a side-effect of the computation (a side-effect could be an overflow trap or something similar).

It's important to appreciate that some algorithms must be executed exactly as written - otherwise they will produce the wrong answer. It's also rather surprising to discover that some algorthims will not always produce the right answer.

An example of this is summing up a series of numbers. Imagine that the numbers happen to be sorted from largest to smallest, and the summation adds all the large numbers before it starts adding the small numbers. Once the sum gets beyond a certain value, the small numbers will be so small in comparison that they will not change the value of the summation - so the result could be significantly incorrect.

Now imagine flipping the order of the sequence, and adding the small numbers first, then the large ones. In this case the result will be more accurate because the summation will be adding small numbers to other small numbers before adding the large numbers. Professor Kahan discusses a number of these kinds of issues.

So what about using -fast and floating point arithmetic that does not adhere to the standard? I guess this is where I have a skewed view of the whole thing.

Adherence to the standard does not guarantee the right answer. It does guarantee that what is calculated is what you asked for. However, as shown in the example above sometimes what is programmed is not going to give the most accurate result in all cases. Or throwing that point on its head, a program that contains floating point computation is unlikely to produce the correct result in all cases, unless some technique has been applied to it to ensure that it will.

What this means to -fast is that, to my mind at least, the answer from a code which is compiled with floating point simplification is likely to be just as incorrect as the result from a code compiled without it. So if the results are going to be incorrect, they might as well be quick and incorrect rather than slow and incorrect.

This is rather a pessimistic view. Most codes don't suffer from this. In fact, just moving floating point computations to double precision typically moves most of the differences between the calculation with -fast and without it into the last few (insignificant) bits of the results, and 'solves' the problem.

There is an upside, to this. If a code produces a significant difference in results with and without floating point simplification, it means one of two things:

- The code has been carefully crafted to take advantage of the IEEE-754 standard
- The code is broken, and is sensitive to the last few insignificant bits of the computation (which probably means that the answer is nonsense).

And that's where my view is skewed. Basically if a code produces different results under -fast, then the code is almost certainly broken, and can't be trusted anyway.