## Thursday Feb 25, 2010

### Notions of Floating-Point Equality

Moving on from identity and equality of objects, different notions of equality are also surprisingly subtle in some numerical realms.

As comes up from time to time and is often surprising, the "`==`" operator defined by IEEE 754 and used by Java for comparing floating-point values (JLSv3 §15.21.1) is not an equivalence relation. Equivalence relations satisfy three properties, reflexivity (something is equivalent to itself), symmetry (if a is equivalent to b, b is equivalent to a), and transitivity (if a is equivalent to b and b is equivalent to c, then a is equivalent to c).

The IEEE 754 standard defines four possible mutually exclusive ordering relations between floating-point values:

• equal

• greater than

• less than

• unordered

A NaN (Not a Number) is unordered with respective to every floating-point value, including itself. This was done so that NaNs would not quietly slip by without due notice. Since (NaN == NaN) is false, the IEEE 754 "`==`" relation is not an equivalence relation since it is not reflexive.

An equivalence relation partitions a set into equivalence classes; each member of an equivalence classes is "the same" as the other members of the classes for the purposes of that equivalence relation. In terms of numerics, one would expect equivalent values to result in equivalent numerical results in all cases. Therefore, the size of the equivalence classes over floating-point values would be expected to be one; a number would only be equivalent to itself. However, in IEEE 754 there are two zeros, -0.0 and +0.0, and they compare as equal under `==`. For IEEE 754 addition and subtraction, the sign of a zero argument can at most affect the sign of a zero result. That is, if the sum or difference is not zero, a zero of either sign doesn't change the result. If the sum or differnece is zero and one of the arguments is zero, the other argument must be zero too:

• -0.0 + -0.0 ⇒ -0.0

• -0.0 + +0.0 ⇒ +0.0

• +0.0 + -0.0 ⇒ +0.0

• +0.0 + +0.0 ⇒ +0.0

Therefore, under addition and subtraction, both signed zeros are equivalent. However, they are not equivalent under division since 1.0/-0.0 ⇒ -∞ but 1.0/+0.0 ⇒ +∞ and -∞ and +∞ are not equivalent.1

Despite the rationales for the IEEE 754 specification to not define `==` as an equivalence relation, there are legitimate cases where one needs a true equivalence relation over floating-point values, such as when writing test programs, and cases where one needs a total ordering, such as when sorting. In my numerical tests I use a method that returns `true` for two floating-point values x and y if:
((x == y) &&
(if x and y are both zero they have the same sign)) ||
(x and y are both NaN)
Conveniently, this is just computed by using `(Double.compare(x, y) == 0)`. For sorting or a total order, the semantics of `Double.compare` are fine; NaN is treated as being the largest floating-point values, greater than positive infinity, and -0.0 < +0.0. That ordering is the total order used by by `java.util.Arrays.sort(double[])`. In terms of semantics, it doesn't really matter where the NaNs are ordered with respect to ther values to as long as they are consistently ordered that way.2

These subtleties of floating-point comparison were also germane on the Project Coin mailing list last year; the definition of floating-point equality was discussed in relation to adding support for relational operations based on a type implementing the `Comparable` interface. That thread also broached the complexities involved in comparing BigDecimal values.

The `BigDecimal` class has a natural ordering that is inconsistent with equals; that is for at least some inputs `bd1` and `bd2`,
`c.compare(bd1, bd2)==0`
has a different boolean value than
`bd1.equals(bd2)`.3
In `BigDecimal`, the same numerical value can have multiple representations, such as (100 × 100) versus (10 × 101) versus (1 × 102). These are all "the same" numerically (`compareTo == 0`) but are not `equals` with each other. Such values are not equivalent under the operations supported by `BigDecimal`; for example (100 × 100) has a scale of 0 while (1 × 102) has a scale of -2.4

While subtle, the different notions of numerical equality each serve a useful purpose and knowing which notion is appropriate for a given task is an important factor in writing correct programs.

1 There are two zeros in IEEE 754 because there are two infinities. Another way to extend the real numbers to include infinity is to have a single (unsigned) projective infinity. In such a system, there is only one conceptual zero. Early x87 chips before IEEE 754 was standardized had support for both signed (affine) and projective infinities. Each style of infinity is more convenient for some kinds of computations.

2 Besides the equivalence relation offered by `Double.compare(x, y)`, another equivalence relation can be induced by either of the bitwise conversion routines, Double.doubleToLongBits or Double.doubleToRawLongBits. The former collapses all bit patterns that encode a NaN value into a single canonical NaN bit pattern, while the latter can let through a platform-specific NaN value. Implementation freedoms allowed by the original IEEE 754 standard have allowed different processor families to define different conventions for NaN bit patterns.

3 I've at times considered whether it would be worthwhile to include an "`@NaturalOrderingInconsistentWithEquals`" annotation in the platform to flag the classes that have this quirk. Such an annotation could be used by various checkers to find potentially problematic uses of such classes in sets and maps.

4 Building on wording developed for the `BigDecimal` specification under JSR 13, when I was editor of the IEEE 754 revision, I introduced several pieces of decimal-related terminology into the draft. Those terms include preferred exponent, analogous to the preferred scale from `BigDecimal`, and cohort, "The set of all floating-point representations that represent a given floating-point number in a given floating-point format." Put in terms of `BigDecimal`, the members of a cohort would be all the `BigDecimal` numbers with the same numerical value, but distinct pairs of scale (negation of the exponent) and unscaled value.

## Saturday Feb 20, 2010

### Everything Older is Newer Once Again

Catching up on writing about more numerical work from years past, the second article in a two-part series finished last year discusses some low-level floating-point manipulations methods I added to the platform over the course of JDKs 5 and 6. Previously, I published a blog entry reacting to the first part of the series.

JDK 6 enjoyed several numerics-related library changes. Constants for `MIN_NORMAL`, `MIN_EXPONENT`, and `MAX_EXPONENT` were added to the `Float` and `Double` classes. I also added to the `Math` and `StrictMath` classes the following methods for low-level manipulation of floating-point values:

• ``` public static double copySign(double magnitude, double sign)```
• `public static int getExponent(double d)`
• `public static double nextAfter(double start, double direction)`
• `public static double nextUp(double d)`
• `public static double scalb(double d, int scaleFactor)`

There are also overloaded methods for `float` arguments. In terms of the IEEE 754 standard from 1985, the methods above provide the core functionality of the recommended functions. In terms of the 2008 revision to IEEE 754, analogous functions are integrated throughout different sections of the document.

While a student at Berkeley, I wrote a tech report on algorithms I developed for an earlier implementation of these methods, an implementation written many years ago when I was a summer intern at Sun. The implementation of the recommended functions in the JDK is a refinement of the earlier work, a refinement that simplified code, added extensive and effective unit tests, and sported better performance in some cases. In part the simplifications came from not attempting to accommodate IEEE 754 features not natively supported in the Java platform, in particular rounding modes and sticky flags.

The primary purpose of these methods is to assist in in the development of math libraries in Java, such as the recent pure Java implementation of floor and ceil (6908131). This expected use-case drove certain API differences with the functions sketched by IEEE 754. For example, the `getExponent` method simply returns the unbiased value stored in the exponent field of a floating-point value rather than doing additional processing, such as computing the exponent needed to normalized a subnormal number, additional processing called for in some flavors of the 754 `logb` operation. Such additional functionality can actually slow down math libraries since libraries may not benefit from the additional filtering and may actually have to undo it.

The `Math` and `StrictMath` specifications of `copySign` have a small difference: the `StrictMath` version always treats NaNs as having a positive sign (a sign bit of zero) while the `Math` version does not impose this requirement. The IEEE standard does not ascribe a meaning to the sign bit of a NaN and difference processors have different conventions NaN representations and how they propagate. However, if the source argument is not a NaN, the two `copySign` methods will produce equivalent results. Therefore, even if being used in a library where the results need to be completely predictable, the faster `Math` version of `copySign` can be used as long as the source argument is known to be numerical.

The recommended functions can also be used to solve a little floating-point puzzle: generating the interesting limit values of a floating-point format just starting with constants for `0.0` and `1.0` in that format:

• `NaN` is `0.0/0.0`.

• `POSITIVE_INFINITY` is `1.0/0.0`.

• `MAX_VALUE` is `nextAfter(POSITIVE_INFINITY, 0.0)`.

• `MIN_VALUE` is `nextUp(0.0)`.

• `MIN_NORMAL` is `MIN_VALUE/(nextUp(1.0)-1.0)`.

## Friday Feb 12, 2010

### Finding a bug in FDLIBM pow

Writing up a piece of old work for some more Friday fun, an example of testing where the failures are likely to be led to my independent discovery of a bug in the FDLIBM `pow` function, one of only two bugs fixed in FDLIBM 5.3. Even back when this bug was fixed for Java some time ago (5033578), the FDLIBM library was well-established, widely used in the Java platform and elsewhere, and already thoroughly tested so I was quite proud my tests found a new problem. The next most recent change to the `pow` implementation was eleven years prior to the fix in 5.3.

The specification for `Math.pow` is involved, with over two dozen special cases listed. When setting out to write tests for this method, I re-expressed the specification in a tabular form to understand what was going on. After a few iterations reminiscent of tweaking a Karnaugh map, the table below was the result.

Special Cases for FDLIBM `pow` and {`Math`, `StrictMath`}`.pow`
xy y
x –∞ –∞ < y < 1 –1 –1 < y < 0 –0.0 +0.0 0 < y < 1 1 1 < y < +∞ +∞ NaN
–∞ +0.0 f2(y) 1.0 f1(y) +∞ NaN
–∞ < y < –1 +0.0 f3(x, y) f3(x, y) +∞
–1 NaN NaN
–1 < y < 0 +∞ +0.0
–0.0 +∞ f1(y) f2(y) +0.0
+0.0 +∞ +0.0
0 < y < 1 +∞     x   +0.0
1 NaN 1.0 NaN
1 < y < +∞ +0.0 x +∞
+∞ +0.0 +∞
NaN NaN NaN

f1(y) = isOddInt(y) ? –∞ : +∞;
f2(y) = isOddInt(y) ? –0.0 : +0.0;
f3(x, y) = isEvenInt(y) ? |x|y : (isOddInt(y) ? –|x|y : NaN);
Defined to be +1.0 in C99, see §F.9.4.4 of the C99 specification. Large magnitude finite floating-point numbers are all even integers (since the precision of a typical floating-point format is much less than its exponent range, a large number will be an integer times the base raised to a power). Therefore, by the reasoning of the C99 committee, `pow(-1.0, ∞)` was like `pow(-1.0, Unknown large even integer)` so the result was defined to be `1.0` instead of `NaN`.

The range of arguments in each row and column are partitioned into eleven categories, ten categories of finite values together with NaN (Not a Number). Some combination of x and y arguments are covered by multiple clauses of the specification. A few helper functions are defined to simplify the presentation. As noted in the table, a cross-platform wrinkle is that the C99 specification, which came out after Java was first released, defined certain special cases differently than in FDLIBM and Java's `Math.pow`.

A regression test based on this tabular representation of `pow` special cases is `jdk/test/java/lang/Math/PowTests.java`. The test makes sure each interesting combination in the table is probed at least once. Some combinations receive multiple probes. When an entry represents a range, the exact endpoints of the range are tested; in addition, other interesting interior points are tested too. For example, for the range 1 < x< +∞ the individual points tested are:

```+1.0000000000000002, // nextAfter(+1.0, +oo)
+1.0000000000000004,
+2.0,
+Math.E,
+3.0,
+Math.PI,
-(double)Integer.MIN_VALUE - 1.0,
-(double)Integer.MIN_VALUE,
-(double)Integer.MIN_VALUE + 1.0,
double)Integer.MAX_VALUE + 4.0,
(double) ((1L<<53)-1L),
(double) ((1L<<53)),
(double) ((1L<<53)+2L),
-(double)Long.MIN_VALUE,
Double.MAX_VALUE,
```

Besides the endpoints, the interesting interior points include points worth checking because of transitions either in the IEEE 754 `double` format or a 2's complement integer format.

Inputs that used to fail under this testing include a range of severities, from the almost always numerical benign error of returning a wrongly signed zero, to returning a zero when the result should be finite nonzero result, to returning infinity for a finite result, to even returning a wrongly singed infinity!

### Selected Failing Inputs

```Failure for StrictMath.pow(double, double):
For inputs -0.5                   (-0x1.0p-1) and
9.007199254740991E15  (0x1.fffffffffffffp52)
expected   -0.0                   (-0x0.0p0)
got         0.0                   (0x0.0p0).

Failure for StrictMath.pow(double, double):
For inputs -0.9999999999999999    (-0x1.fffffffffffffp-1) and
9.007199254740991E15  (0x1.fffffffffffffp52)
expected   -0.36787944117144233   (-0x1.78b56362cef38p-2)
got        -0.0                   (-0x0.0p0).

Failure for StrictMath.pow(double, double):
For inputs -1.0000000000000004    (-0x1.0000000000002p0) and
9.007199254740994E15  (0x1.0000000000001p53)
expected  54.598150033144236      (0x1.b4c902e273a58p5)
got       0.0                     (0x0.0p0).

Failure for StrictMath.pow(double, double):
For inputs -0.9999999999999998    (-0x1.ffffffffffffep-1) and
9.007199254740992E15  (0x1.0p53)
expected    0.13533528323661267   (0x1.152aaa3bf81cbp-3)
got         0.0                   (0x0.0p0).

Failure for StrictMath.pow(double, double):
For inputs -0.9999999999999998    (-0x1.ffffffffffffep-1) and
-9.007199254740991E15  (-0x1.fffffffffffffp52)
expected   -7.38905609893065      (-0x1.d8e64b8d4ddaep2)
got        -Infinity              (-Infinity).

Failure for StrictMath.pow(double, double):
For inputs -3.0                   (-0x1.8p1) and
9.007199254740991E15  (0x1.fffffffffffffp52)
expected   -Infinity              (-Infinity)
got        Infinity               (Infinity).
```

The code changes to address the bug were fairly simple; corrections were made to extracting components of the floating-point inputs and sign information was propagated properly.

Even expertly written software can have errors and even long-used software can have unexpected problems. Estimating how often this bug in FDLIBM caused an issue is difficult, while the errors could be egregious, the needed inputs to elicit the problem were arguably unusual (even though perfectly valid mathematically). Thorough testing is key aspect of assuring the quality of numerical software, it is also helpful for end-users to be able to examine the output of their programs to help notice problems.

## Wednesday Dec 03, 2008

One of the more obscure language changes included back in JDK 5 was the addition of hexadecimal floating-point literals to the platform. As the name implies, hexadecimal floating-point literals allow literals of the float and double types to be written primarily in base 16 rather than base 10. The underlying primitive types use binary floating-point so a base 16 literal avoids various decimal ↔ binary rounding issues when there is a need to specify a floating-point value with a particular representation.

The conversion rule for decimal strings into binary floating-point values is that the binary floating-point value nearest the exact decimal value must be returned. When converting from binary to decimal, the rule is more subtle: the shortest string that allows recovery of the same binary value in the same format is to be used. While these rules are sensible, surprises are possible from the differing bases used for storage and display. For example, the numerical value 1/10 is not exactly representable in binary; it is a binary repeating fraction just as 1/3 is a repeating fraction in decimal. Consequently, the numerical values of 0.1f and 0.1d are not the same; the exact numeral value of the comparatively low precision float literal 0.1f is
0.100000001490116119384765625
and the shortest string that will convert to this value as a double is
0.10000000149011612.
This in turn differs from the exact numerical value of the higher precision double literal 0.1d,
0.1000000000000000055511151231257827021181583404541015625. Therefore, based on decimal input, it is not always clear what particular binary numerical value will result.

Since floating-point arithmetic is almost always approximate, dealing with some rounding error on input and output is usually benign. However, in some cases it is important to exactly specify a particular floating-point value. For example, the Java libraries include constants for the largest finite double value, numerically equal to (2-2-52)·21023, and the smallest nonzero value, numerically equal to 2-1074. In such cases there is only one right answer and these particular limits are derived from the binary representation details of the corresponding IEEE 754 double format. Just based on those binary limits, it is not immediately obvious how to construct a minimal length decimal string literal that will convert to the desired values.

Another way to create floating-point values is to use a bitwise conversion method, such as doubleToLongBits and longBitsToDouble. However, even for numerical experts this interface is inhumane since all the gory bit-level encoding details of IEEE 754 are exposed and values created in this fashion are not regarded as constants. Therefore, for some use cases it helpful to have a textual representation of floating-point values that is simultaneously human readable, clearly unambiguous, and tied to the binary representation in the floating-point format. Hexadecimal floating-point literals are intended to have these three properties, even if the readability is only in comparison to the alternatives!

Hexadecimal floating-point literals originated in C99 and were later included in the recent revision of the IEEE 754 floating-point standard. The grammar for these literals in Java is given in JLSv3 §3.10.2:

HexFloatingPointLiteral:

HexSignificand BinaryExponent FloatTypeSuffixopt

This readily maps to the sign, significand, and exponent fields defining a finite floating-point value; sign0xsignificandpexponent. This syntax allows the literal

0x1.8p1

to be to used represent the value 3; 1.8hex × 21 = 1.5decimal × 2 = 3. More usefully, the maximum value of
(2-2-52)·21023 can be written as
0x1.fffffffffffffp1023
and the minimum value of
2-1074 can be written as
0x1.0P-1074 or 0x0.0000000000001P-1022, which are clearly mappable to the various fields of the floating-point representation while being much more scrutable than a raw bit encoding.

Retroactively reviewing the possible steps needed to add hexadecimal floating-point literals to the language:

1. Update the Java Language Specification: As a purely syntactic changes, only a single section of the JLS had to updated to accommodate hexadecimal floating-point literals.

2. Implement the language change in a compiler: Just the lexer in javac had to be modified to recognize the new syntax; javac used new platform library methods to do the actual numeric conversion.

3. Add any essential library support: While not strictly necessary, the usefulness of the literal syntax is increased by also recognizing the syntax in Double.parseDouble and similar methods and outputting the syntax with Double.toHexString; analogous support was added in corresponding Float methods. In addition the new-in-JDK 5 Formatter "printf" facility included the %a format for hexadecimal floating-point.

4. Write tests: Regression tests (under test/java/lang/Double in the JDK workspace/repository) were included as part of the library support (4826774).

5. Update the Java Virtual Machine Specification: No JVMS changes were needed for this feature.

6. Update the JVM and other tools that consume classfiles: As a Java source language change, classfile-consuming tools were not affected.

7. Update the Java Native Interface (JNI): Likewise, new literal syntax was orthogonal to calling native methods.

8. Update the reflective APIs: Some of the reflective APIs in the platform came after hexadecimal floating-point literals were added; however, only an API modeling the syntax of the language, such as the tree API might need to be updated for this kind of change.

9. Update serialization support: New literal syntax has no impact on serialization.

10. Update the javadoc output: One possible change to javadoc output would have been supplementing the existing entries for floating-point fields in the constant fields values page with hexadecimal output; however, that change was not done.

In terms of language changes, adding hexadecimal floating-point literals is about as simple as a language change can be, only straightforward and localized changes were need to the JLS and compiler and the library support was clearly separated. Hexadecimal floating-point literals aren't applicable to that many programs, but when they can be used, they have extremely high utility in allowing the source code to clearly reflect the precise numerical intentions of the author.

## Wednesday Oct 29, 2008

### Everything Old is New Again

I was heartened to recently come across the article Java's new math, Part 1: Real numbers which detailed some of the additions I made to Java's math libraries over the years in JDK 5 and 6, including hyperbolic trigonometric functions (sinh, cosh, tanh), cube root, and base-10 log.

A few comments on the article itself, I would describe java.lang.StrictMath as java.lang.Math's fussy twin rather than evil twin. The availability of the StrictMath class allows developers who need cross-platform reproducible results from the math library to get them. Just because floating-point arithmetic is an approximation to real arithmetic doesn't mean it shouldn't be predictable! There are non-contrived circumstances where numerical programs are helped by having such strong reproducibility available. For example, to avoid unwanted communication overhead, certain parallel decomposition algorithms rely on different nodes being able to independently compute consistent numerical answers.

While the java.lang.Math class is not constrained to use the particular FDLIBM algorithms required by StrictMath, any valid Math class implementation still must meet that stated quality of implementation criteria for the methods. The criteria usually include a low worst-case relative error, as measures in ulps (units in the last place), and semi-monotonicity, whenever the mathematical function is non-decreasing, so is the floating-point approximation, likewise, whenever the mathematical function is non-increasing, so is the floating-point approximation

Simply adding more FDLIBM methods to the platform was quite easy to do; much of the effort for the math library additions went toward developing new tests, both to verify that the general quality of implementation criteria were being met as well as that verifying the particular algorithms were being used to implement the StrictMath methods. I'll discuss the techniques I used to develop those tests in a future blog entry.

## Thursday Mar 01, 2007

### Norms: How to Measure Size

At times it is useful to summarize a set of values, say a vector of real numbers, as a single number representing the set's size. For example, distilling benchmark subcomponent scores into an overall score. One way to do this is to use a norm. Mathematically, a norm maps from a vector V of a given number of elements to a real number length such that the following properties hold:

• norm(V) ≥ 0 for all V and norm(V) = 0 if and only if V = 0 (positive definiteness)
• norm(c · V) = abs(c) · norm(V) for real constant c (homogeneity)
• norm(U + V) ≤ norm(U) + norm(V) (the triangle inequality)

There are a few commonly used norms:

• 1-norm: sum of the absolute values (Manhattan length)
• 2-norm: square root of the sum of the squares (Euclidean length)
• ∞-norm: largest absolute value

The first two norms are instances of p-norms. A p-norm adds up the result of raising the absolute value of each vector component to the pth power (squaring, or cubing, etc.) and then takes the pth root of the sum. The ∞-norm is the limit as p goes to infinity.

Given multiple possible norms, which one should be used? The 2-norm is often easier to work with since it is a differentiable function of the vector components, unlike the 1-norm and ∞-norm. On the other hand, the ∞-norm captures the worst-case behavior. Sometimes one norm is easier to compute than the others. Another norm might make an error analysis more tractable. For vectors, in some sense it doesn't matter which norm is used because any two norms, norma and normb, are equivalent in the following sense, there are constants c1 and c2 such that

c1 · norma(V) ≤ normb(V) ≤ c2 · norma(V)

This means that if one norm is tending toward zero, all other norms are tending toward zero too. For example, commonly in numerical linear algebra there is an iterative process that terminates once the norm of the error is small enough. Concretely, for vectors of size n, the common norms are related as follows:

norm2(V) ≤ norm1(V) ≤ sqrt(n) · norm2(V)
norm(V) ≤ norm2(V) ≤ sqrt(n) · norm(V)
norm(V) ≤ norm1(V) ≤ n · norm(V)

So to guarantee that the 1-norm is less than epsilon, it is enough to show that 2-norm is less than epsilon/sqrt(n).

However, in other ways the different norms are not equivalent; the norms can give different answers on the relative size of different vectors. Consider the three vectors A, B, and C:

A = [5, 0, 0]
B = [1, 3, 4]
C = [8/3, 8/3, 3]

Vector 1-norm 2-norm ∞-norm
A 5 5 5
B 8 ≈5.1 4
C ≈8.3 ≈4.8 3
Biggest Vector C B A

Each vector is considered the largest under one of the norms.

I've found the notion of norms to be useful in many different contexts. The performance differences between quicksort and mergesort can be described as quicksort having a better 1-norm but mergesort having a better ∞-norm. Buying more insurance coverage raises the 1-norm of your costs, but lowers your ∞-norm. A more conservative evaluation tends to focus on the worst-case outcome and thus favors something like the ∞-norm. For example, in the math library the relative size of the error at any location must be less than the stated number of ulps (units in the last place). It is not good enough to have a low average error, but a few locations, or even one location, with an very inaccurate result. During software development, risk assessments evolve with the release life cycle. A change that is welcome early in the release may be rejected as too risky a few weeks before shipping; one way to view this phenomena is that a larger value of p is being used to compute risk assessments later in the release.

References
Applied Numerical Linear Algebra, James W. Demmel
Matrix Computations, Gene H. Golub and Charles F Van Loan
Numerical Linear Algebra, Lloyd N. Trefethen and David Bau, III

## Tuesday Oct 03, 2006

### What Every Computer Programmer Should Know About Floating-Point Arithmetic, Redux

Next week on Wednesday, October 11, at the Silicon Valley ACCU meeting in San Jose, I'll be giving a version of my talk on What Every Computer Programmer Should Know About Floating-Point Arithmetic, previously seen at Stanford and JavaOne. The meeting is open to the public and free of charge, so if you've ever wondered why adding up ten copies of `0.1d` doesn't equal `1.0` or doubted the need for a floating-point value that is not a number, come on by.

After the talk, I'll post a copy of the slides.

Update: The slides.

### IEEE 754R Ballot

For a number of years, the venerable IEEE 754 standard for binary floating-point arithmetic has been undergoing revision and the committee's results will soon be up for ballot. Back in 2003, I was editor of the draft for a few months and helped incorporate the decimal material.

The balloting process provides the opportunity for interested parties, such as consumers of the standard, to weigh in with comments; instructions for joining the ballot are available. The deadline for signing up has been extended to October 21, 2006.

Major changes from 754 include:

• Support for decimal formats and arithmetic
• More explicit conceptual model of levels of specification
• Hexadecimal strings for binary floating-point values
• Annexes giving recommendations on expression evaluation, alternate exception handling, and transcendental functions

## Friday Jun 23, 2006

### What Every Computer Programmer Should Know About Floating-Point Arithmetic

I'm a part-time master's student in Stanford's ICME program and at the departmental seminar I recently gave a talk, What Every Computer Programmer Should Know About Floating-Point Arithmetic. This is a refinement and update of JavaOne talks I've given with a similar title.

Darcy-Oracle

##### Archives
Sun Mon Tue Wed Thu Fri Sat « July 2016 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Today
##### News

No bookmarks in folder