### Transcendental Meditation

I got into a conversation with some folks who've been moving a large sophisticated image processing application to Java. They've been getting great performance numbers, much to the surprise of the C crowd in their shop.

With one exception: code that invokes sin() and cos() heavily is somewhat slower. They asked me why this was happening. I had a pretty good idea, but I checked with Joe Darcy, our local Floating Point God, and he had this to say:

For many years, the JDK on x86 platforms has used the hardware fsin/fcos x87 instructions in the range [-pi/4, pi/4], a range which encompasses about half of all representable floating-point values. Therefore, in that range the performance of the JDK's transcendental functions should be nearly the same as the performance of the transcendental functions in C, C++, etc. that are using those same fsin/fcos instructions. Benchmarks which focus on testing the trig performance of large values, such as almabench, present a skewed portrait of Java's trigonometric performance. The next question is why don't we just use fsin/fcos over the entire floating-point range? The simple answer is that fsin/fcos can deliver answers that are arbitrarily wrong for the most straightforward way of measuring error in the result.

Every finite real number, no matter how large, has a well-defined value for sin/cos. Ideally, the floating-point result returned for sin/cos would be the representable floating-point number closest to the mathematically defined result for the floating-point input. A floating-point library having this property is called correctly rounded, which is equivalent to saying the library has an error bound less than or equal to 1/2 an ulp (unit in the last place). For sin/cos, writing a correctly rounding implementation that runs at a reasonable speed is still something of a research problem so in practice platforms often use a library with a 1 ulp error bound instead, which means either of the floating-point numbers adjacent to the true result can be returned. This is the implementation criteria the Java Math library has to meet. The implementation challenge is that sin/cos are implemented using argument reduction whereby any input is mapped into a corresponding input in the [-pi/4, pi/4] range. Since the period of sin/cos is pi and pi is transcendental, this amounts to having to compute a remainder from the division by a transcendental number, which is non-obvious. A few years after the x87 was designed, people figured out how to do this division as if by an exact value of pi. Instead the x87 fsin/fcos use a particular approximation to pi, which effectively means the period of the function is changed, which can lead to large errors outside [-pi/4, pi/4]. For example the value of sine for the floating-point number Math.PI is around

1.2246467991473532E-16

while the computed value from fsin is

1.2246063538223773E-16

In other words, instead of getting the full 15-17 digit accuracy of double, the returned result is only correct to about 5 decimal digits. In terms of ulps, the error is about 1.64e11 ulps, over \*ten billion\* ulps. With some effort, I'm confident I could find results with the wrong sign, etc. There is a rationale which can justify this behavior; however, it was much more compelling before the argument reduction problem was solved.

This error has tragically become un-fixable because of the compatibility requirements from one generation to the next. The fix for this problem was figured out quite a long time ago. In the excellent paper The K5 transcendental functions by T. Lynch, A. Ahmed, M. Schulte, T. Callaway, and R. Tisdale a technique is described for doing argument reduction as if you had an infinitely precise value for pi. As far as I know, the K5 is the only x86 family CPU that did sin/cos accurately. AMD went back to being bit-for-bit compatibile with the old x87 behavior, assumably because too many applications broke. Oddly enough, this is fixed in Itanium.

What we do in the JVM on x86 is moderately obvious: we range check the argument, and if it's outside the range [-pi/4, pi/4]we do the precise range reduction by hand, and then call fsin.

So Java is accurate, but slower. I've never been a fan of "fast, but wrong" when "wrong" is roughly random(). Benchmarks rarely test accuracy. "double sin(double theta) { return 0; }" would be a great benchmark-compatible implementation of sin(). For large values of theta, 0 would be arguably more accurate since the absolute error is never greater than 1. fsin/fcos can have absolute errors as large as 2 (correct answer=1; returned result=-1).

This is one of those area where no matter what we do, we're screwed.

(Wed Jul 27 10:41:59 PDT 2005)

I know this is stupid, and nobody reads comments to a crossposted blog, but anyway. The discussion above quotes in part a discussion in http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5005861 I happen to have just run a kind of investigative benchmark, and, found that only for unusually large argument sin() gets very slow (makes sense to think about the algorithm being used). I ran strictfp sin() vs plain sin() vs native sin() method (which immediately calls FPU's FSIN). The problem with native is that it hotspot does not optimize the calls, and it seems to take a lot of time between java code and c code. Certain argument values caused "non-strictfp" values to be different from those calculated via strictfp. <table border=1> <tr><td>x</td><td>strictfp sin(x)</td><td>plain sin(x)</td><td>native sin(x)</td></tr> <tr><td> 0.5</td><td>700
0.479425538604203</td><td>511
5.095750210681871E-18</td><td>3145
5.095750210681871E-18</td></tr> <tr><td> 1.2</td><td>1423
0.9320390859672263</td><td>2003
0.0</td><td>3726
0.0</td></tr> <tr><td> 3.5</td><td>1573
-0.35078322768961984</td><td>1493
0.0</td><td>3024
0.0</td></tr> <tr><td> 10000.5</td><td>1761
-0.7246894586135745</td><td>1882
0.0</td><td>3685
0.0</td></tr> <tr><td> 1.000000005E8</td><td>14370
0.6433740719829194</td><td>13630
0.0</td><td>3025
9.85878045867139E-14</td></tr> </table> Time here is the number of milliseconds per 10 million calls; the double values below times are the strictfp value of sin(), and in other columns it is the diff between this value and strictfp value.

Posted by Vlad Patryshev on July 27, 2005 at 03:40 PM PDT #

Its even EZ to test with gcc.

'gonzo' is a Powerbook (10.4.2) . 'gentoo' is a P3 box (linux):

```<code>
gonzo:~ jim\$ cat tt.c
#include &ltmath.h>
#include &ltstdio.h>

int
main(int argc, char\*\* argv)
{
double d = M_PI;
printf("sin(PI) is %.16le\\n", sin(d));
d = M_PI_4;
printf("sin(PI/4) is %.16le\\n", sin(d));
}

gonzo:~ jim\$ gcc tt.c
gonzo:~ jim\$ ./a.out
sin(PI) is 1.2246467991473532e-16
sin(PI/4) is 7.0710678118654757e-01
gonzo:~ jim\$ scp tt.c gentoo:
tt.c                                          100%  196     0.2KB/s   00:00
gonzo:~ jim\$ ssh gentoo
/usr/jim> gcc tt.c -lm
/usr/jim> ./a.out
sin(PI) is 1.2246063538223773e-16
sin(PI/4) is 7.0710678118654746e-01
</code>```

Posted by Jim Thompson on July 27, 2005 at 04:03 PM PDT #

Anybody else find reading this site is difficult to do with background image that is used? I mean it looks cool and all, but the image is too dark and text (particularly if it is italicized) tends to get lost.

Posted by Ryan on July 28, 2005 at 12:08 AM PDT #

Yes, indeed... the background image makes these entries incredibly difficult to read.

Posted by Jess Sightler on July 28, 2005 at 03:31 AM PDT #

As another party in the ongoing discussion, I would like to add a couple of comments :-). The main one is that categorizing the fsin and fcos implementations as random, or completely wrong, is doing a bit of a disservice. In the situation that the code is being benched, the comparison is a visual one. The 6th significant digit being wrong is clearly not ideal, but in this case the image being produced as a result of the georectification taking place is indistinguishable from the C++ comparison implmentation against which it is being tested. This is quite a bit different from if random numbers are being kicked out where the image would be very different. As it is, no difference between the georectified image in the fsin based C++ app and the stricter Java implementation can be perceived. <p/> The comparison does show one very obvious difference though, java takes over twice as long as the C++ to produce an image with no visible differences. This is definitely noticeable. I believe that anyone considering adoption of Java for a new project will run tests like these. They are not regular benchmarks, since the visible results are compared. In this case, the demo application demonstrates that Java takes twice as long as C++ on the same hardware to georectify an image. This result would be enough for most developers to simply walk away from the Java option. In this case we dug a bit deeper to get through to the core of the problem, and found that in all other ways Java was the superior option. However, that gap still exists, and given that this is an application where performance is never good enough, throwing away half of the performance is not an option. <p/> The real problem is that as developers, we have no recourse. Implementing pure JNI implementations to call through to fsin and fcos did speed things up a little, but you pay the marshalling overheads. There is no other way short of the JIUL licensing to fix the problem and that cannot be re-distributed. <p/> I think that a line is crossed when developers are being protected from themselves. As root in unix, one can type rm -rf / and have the machine start erasing your hard drive - this is the price you pay for choice and power. I believe that developers should be free to make their own informed choices about the balance of speed and accuracy they desire.

Posted by Dick Wall on July 28, 2005 at 03:37 AM PDT #

Yes, Dick, but to do so, they first need to understand the tradeoff. James' entry did a good job of explaining it. I lean towards his preference for "the right answer, slowly" than "the wrong answer, fast", since the developer who wants fast, approximate answers can always use something other than Java to get them.

Posted by Toby on July 28, 2005 at 04:51 AM PDT #

thanks for the info

Posted by KW Beh on July 28, 2005 at 10:33 AM PDT #

Agreed, I have no problem with people having all the facts, being told that these routines are unsafe, having surgeon general health warnings on the packet, whatever it takes, but surely the answer of "use another language" is not one we should be pushing out. <p/> when I was lucky enough to talk to James Gosling at JavaOne he mentioned in passing that he was disappointed at the adoption of Java in the math community. I think that this may be a big part of the reason why. If we are sending out the message that have at least the choice of speed, developers must choose to use another language, I believe that's what they will do, and that would be a shame.

We are not asking for the default to be changed, and we do not even necessarily need an implementation from Sun, but what we do need is a way to get at the faster routines where we need them, without paying the JNI overheads. I don't think that is unreasonable.

Posted by Dick Wall on July 28, 2005 at 11:39 AM PDT #

how hard is it to implement sin/cos in 100% pure Java? isn't it just a Taylor series with table for speed? (Or maybe I'm just naive)

Posted by am on August 02, 2005 at 07:46 AM PDT #

As the lead of the team that is embarking on this huge endeavor to move a large scientific computing framework to Java/J2EE, I find it concerning when I see use another language answer. We did do our due diligence and chose Java/J2EE for all the other benefits that it provides but it will be a shame to give all that up just because of a theoretical argument. All native programs running on x86 platforms use these “wrong” function and that is what people have been using for a long time and it is not up to us to tell the customers what to do. When you are performing these computations a million times(for each pixel) and the user is panning over an image it is not tolerable to be this much slower and paying the JNI penalty is also not acceptable. So as Dick has correctly pointed out all we need is a way to tell the JVM to use the hardware for transcendental functions. May be create a library “NonStrictMath” that gives the same performance as C/C++ using the native platform optimizations.

Posted by Belai Beshah on August 03, 2005 at 02:03 AM PDT #

Comments are closed for this entry.

jag

##### Archives
Sun Mon Tue Wed Thu Fri Sat « October 2015 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Today