Friday Jun 15, 2012

My integer overfloweth

While certain classes like java.lang.Integer and java.lang.Math have been in the platform since the beginning, that doesn't mean there aren't more enhancements to be made in such places! For example, earlier in JDK 8, library support was added for unsigned integer arithmetic. More recently, my colleague Roger Riggs pushed a changeset to support integer overflow, that is, to provide methods which throw an ArithmeticException on overflow instead of returning a wrapped result. Besides being helpful for various programming tasks in Java, methods like the those for integer overflow can be used to implement runtimes supporting other languages, as has been requested at a past JVM language summit.

This year's language summit is coming up in July and I hope to get some additional suggestions there for helpful library additions as part of the general discussions of the JVM and Java libraries as a platform.

Friday Jan 20, 2012

Unsigned Integer Arithmetic API now in JDK 8

At long last, after due discussion and review, I've just pushed initial API support for unsigned integer arithmetic into JDK 8! The support is implemented via static methods, primarily on java.lang.Integer and java.lang.Long, that:

  • Provide bidirectional conversion between strings and unsigned integers
  • Compare values as unsigned
  • Compute unsigned divide and remainder

Colloquially, "unsigned integer" means a 32-bit int or 64-bit long value where all the bits are interpreted as contributing to the magnitude. In the unsigned realm, the values of an integer type of a given bit-width range from 0 to 2width-1 rather than from -(2width-1) to 2width-1-1. A feature of the two's complement encoding of Java integers is that the bitwise results for add, subtract, and multiply are the same if both inputs are interpreted as signed values or both inputs are interpretted as unsigned values. (Other encodings like one's complement and signed magnitude don't have this properly.) Therefore, of the basic arithmetic operations, only a separate divide method needs to be provided to operate on values interpreted as unsigned.

To avoid dealing with the overhead of boxed values and to allow reuse of the built-in arithmetic operators, the unsigned API support does not introduce new types like UnsignedInt with instance methods to perform addition, subtraction, etc. However, that lack of separate Java-level unsigned types does mean a programmer can accidentally improperly mix signed and unsigned values. However, new unsigned types aren't the only way to mitigate this hazard. For example, a naming convention of adding a trailing "U" or "_U" to variables holding unsigned values could be adopted. A more structured approach would be to add an @Unsigned annotation type to the platform and apply that annotation to variables and fields holding unsigned values. One of the extra-linguistic checkers to be enabled by JSR 308 could then analyze code for signed/unsigned correctness.

I'm glad these methods are finally in the JDK. Later in JDK 8, there may be a few more fun bit-twiddling additions, such as methods to get the high order bits of a full multiply and methods which throw exceptions on integer overflow instead of wrapping around.

Friday May 13, 2011

Knuth: "Be an edge, not a node."

At a recent talk, engineering hero and consummate computer scientist Donald Knuth gave the following graph-theoretic advice for future professors: be edges, not nodes. That is, plan to have two sub-specialties in different fields and be able to act as a bridge between disciplines. Perhaps the source of this advice is related to the proof of his Knuth is fondest of: as edges are randomly to a set of nodes, after initially forming a quite disconnected graph, at a certain point there are two quick phase transitions and afterward a large fraction of the nodes then share membership in a single connected component.

I think Knuth's advice is sound for non-academics too (and a step beyond just being T-shaped). Some of the blog entries I've most enjoyed writing are those that bring together tools from different areas of computer science, such as numerical linear algebra and programming languages.

Friday Jul 23, 2010

Understanding Inception using try-with-resources

SPOILERS for the film "Inception" as represented using the try-with-resources statement.[Read More]

Thursday Feb 25, 2010

Notions of Floating-Point Equality

Moving on from identity and equality of objects, different notions of equality are also surprisingly subtle in some numerical realms.

As comes up from time to time and is often surprising, the "==" operator defined by IEEE 754 and used by Java for comparing floating-point values (JLSv3 §15.21.1) is not an equivalence relation. Equivalence relations satisfy three properties, reflexivity (something is equivalent to itself), symmetry (if a is equivalent to b, b is equivalent to a), and transitivity (if a is equivalent to b and b is equivalent to c, then a is equivalent to c).

The IEEE 754 standard defines four possible mutually exclusive ordering relations between floating-point values:

  • equal

  • greater than

  • less than

  • unordered

A NaN (Not a Number) is unordered with respective to every floating-point value, including itself. This was done so that NaNs would not quietly slip by without due notice. Since (NaN == NaN) is false, the IEEE 754 "==" relation is not an equivalence relation since it is not reflexive.

An equivalence relation partitions a set into equivalence classes; each member of an equivalence classes is "the same" as the other members of the classes for the purposes of that equivalence relation. In terms of numerics, one would expect equivalent values to result in equivalent numerical results in all cases. Therefore, the size of the equivalence classes over floating-point values would be expected to be one; a number would only be equivalent to itself. However, in IEEE 754 there are two zeros, -0.0 and +0.0, and they compare as equal under ==. For IEEE 754 addition and subtraction, the sign of a zero argument can at most affect the sign of a zero result. That is, if the sum or difference is not zero, a zero of either sign doesn't change the result. If the sum or differnece is zero and one of the arguments is zero, the other argument must be zero too:

  • -0.0 + -0.0 ⇒ -0.0

  • -0.0 + +0.0 ⇒ +0.0

  • +0.0 + -0.0 ⇒ +0.0

  • +0.0 + +0.0 ⇒ +0.0

Therefore, under addition and subtraction, both signed zeros are equivalent. However, they are not equivalent under division since 1.0/-0.0 ⇒ -∞ but 1.0/+0.0 ⇒ +∞ and -∞ and +∞ are not equivalent.1

Despite the rationales for the IEEE 754 specification to not define == as an equivalence relation, there are legitimate cases where one needs a true equivalence relation over floating-point values, such as when writing test programs, and cases where one needs a total ordering, such as when sorting. In my numerical tests I use a method that returns true for two floating-point values x and y if:
((x == y) &&
(if x and y are both zero they have the same sign)) ||
(x and y are both NaN)
Conveniently, this is just computed by using (Double.compare(x, y) == 0). For sorting or a total order, the semantics of Double.compare are fine; NaN is treated as being the largest floating-point values, greater than positive infinity, and -0.0 < +0.0. That ordering is the total order used by by java.util.Arrays.sort(double[]). In terms of semantics, it doesn't really matter where the NaNs are ordered with respect to ther values to as long as they are consistently ordered that way.2

These subtleties of floating-point comparison were also germane on the Project Coin mailing list last year; the definition of floating-point equality was discussed in relation to adding support for relational operations based on a type implementing the Comparable interface. That thread also broached the complexities involved in comparing BigDecimal values.

The BigDecimal class has a natural ordering that is inconsistent with equals; that is for at least some inputs bd1 and bd2,
c.compare(bd1, bd2)==0
has a different boolean value than
bd1.equals(bd2).3
In BigDecimal, the same numerical value can have multiple representations, such as (100 × 100) versus (10 × 101) versus (1 × 102). These are all "the same" numerically (compareTo == 0) but are not equals with each other. Such values are not equivalent under the operations supported by BigDecimal; for example (100 × 100) has a scale of 0 while (1 × 102) has a scale of -2.4

While subtle, the different notions of numerical equality each serve a useful purpose and knowing which notion is appropriate for a given task is an important factor in writing correct programs.


1 There are two zeros in IEEE 754 because there are two infinities. Another way to extend the real numbers to include infinity is to have a single (unsigned) projective infinity. In such a system, there is only one conceptual zero. Early x87 chips before IEEE 754 was standardized had support for both signed (affine) and projective infinities. Each style of infinity is more convenient for some kinds of computations.

2 Besides the equivalence relation offered by Double.compare(x, y), another equivalence relation can be induced by either of the bitwise conversion routines, Double.doubleToLongBits or Double.doubleToRawLongBits. The former collapses all bit patterns that encode a NaN value into a single canonical NaN bit pattern, while the latter can let through a platform-specific NaN value. Implementation freedoms allowed by the original IEEE 754 standard have allowed different processor families to define different conventions for NaN bit patterns.

3 I've at times considered whether it would be worthwhile to include an "@NaturalOrderingInconsistentWithEquals" annotation in the platform to flag the classes that have this quirk. Such an annotation could be used by various checkers to find potentially problematic uses of such classes in sets and maps.

4 Building on wording developed for the BigDecimal specification under JSR 13, when I was editor of the IEEE 754 revision, I introduced several pieces of decimal-related terminology into the draft. Those terms include preferred exponent, analogous to the preferred scale from BigDecimal, and cohort, "The set of all floating-point representations that represent a given floating-point number in a given floating-point format." Put in terms of BigDecimal, the members of a cohort would be all the BigDecimal numbers with the same numerical value, but distinct pairs of scale (negation of the exponent) and unscaled value.

Friday Feb 12, 2010

Finding a bug in FDLIBM pow

Writing up a piece of old work for some more Friday fun, an example of testing where the failures are likely to be led to my independent discovery of a bug in the FDLIBM pow function, one of only two bugs fixed in FDLIBM 5.3. Even back when this bug was fixed for Java some time ago (5033578), the FDLIBM library was well-established, widely used in the Java platform and elsewhere, and already thoroughly tested so I was quite proud my tests found a new problem. The next most recent change to the pow implementation was eleven years prior to the fix in 5.3.

The specification for Math.pow is involved, with over two dozen special cases listed. When setting out to write tests for this method, I re-expressed the specification in a tabular form to understand what was going on. After a few iterations reminiscent of tweaking a Karnaugh map, the table below was the result.

Special Cases for FDLIBM pow and {Math, StrictMath}.pow
xy y
x –∞ –∞ < y < 1 –1 –1 < y < 0 –0.0 +0.0 0 < y < 1 1 1 < y < +∞ +∞ NaN
–∞ +0.0 f2(y) 1.0 f1(y) +∞ NaN
–∞ < y < –1 +0.0 f3(x, y) f3(x, y) +∞
–1 NaN NaN
–1 < y < 0 +∞ +0.0
–0.0 +∞ f1(y) f2(y) +0.0
+0.0 +∞ +0.0
0 < y < 1 +∞     x   +0.0
1 NaN 1.0 NaN
1 < y < +∞ +0.0 x +∞
+∞ +0.0 +∞
NaN NaN NaN

f1(y) = isOddInt(y) ? –∞ : +∞;
f2(y) = isOddInt(y) ? –0.0 : +0.0;
f3(x, y) = isEvenInt(y) ? |x|y : (isOddInt(y) ? –|x|y : NaN);
Defined to be +1.0 in C99, see §F.9.4.4 of the C99 specification. Large magnitude finite floating-point numbers are all even integers (since the precision of a typical floating-point format is much less than its exponent range, a large number will be an integer times the base raised to a power). Therefore, by the reasoning of the C99 committee, pow(-1.0, ∞) was like pow(-1.0, Unknown large even integer) so the result was defined to be 1.0 instead of NaN.

The range of arguments in each row and column are partitioned into eleven categories, ten categories of finite values together with NaN (Not a Number). Some combination of x and y arguments are covered by multiple clauses of the specification. A few helper functions are defined to simplify the presentation. As noted in the table, a cross-platform wrinkle is that the C99 specification, which came out after Java was first released, defined certain special cases differently than in FDLIBM and Java's Math.pow.

A regression test based on this tabular representation of pow special cases is jdk/test/java/lang/Math/PowTests.java. The test makes sure each interesting combination in the table is probed at least once. Some combinations receive multiple probes. When an entry represents a range, the exact endpoints of the range are tested; in addition, other interesting interior points are tested too. For example, for the range 1 < x< +∞ the individual points tested are:

+1.0000000000000002, // nextAfter(+1.0, +oo)
+1.0000000000000004,
+2.0,
+Math.E,
+3.0,
+Math.PI,
-(double)Integer.MIN_VALUE - 1.0,
-(double)Integer.MIN_VALUE,
-(double)Integer.MIN_VALUE + 1.0,
 double)Integer.MAX_VALUE + 4.0,
 (double) ((1L<<53)-1L),
 (double) ((1L<<53)),
 (double) ((1L<<53)+2L),
-(double)Long.MIN_VALUE,
  Double.MAX_VALUE,

Besides the endpoints, the interesting interior points include points worth checking because of transitions either in the IEEE 754 double format or a 2's complement integer format.

Inputs that used to fail under this testing include a range of severities, from the almost always numerical benign error of returning a wrongly signed zero, to returning a zero when the result should be finite nonzero result, to returning infinity for a finite result, to even returning a wrongly singed infinity!

Selected Failing Inputs

Failure for StrictMath.pow(double, double):
       For inputs -0.5                   (-0x1.0p-1) and 
                   9.007199254740991E15  (0x1.fffffffffffffp52)
       expected   -0.0                   (-0x0.0p0)
       got         0.0                   (0x0.0p0).

Failure for StrictMath.pow(double, double):
       For inputs -0.9999999999999999    (-0x1.fffffffffffffp-1) and 
                   9.007199254740991E15  (0x1.fffffffffffffp52)
       expected   -0.36787944117144233   (-0x1.78b56362cef38p-2)
       got        -0.0                   (-0x0.0p0).

Failure for StrictMath.pow(double, double):
       For inputs -1.0000000000000004    (-0x1.0000000000002p0) and 
                   9.007199254740994E15  (0x1.0000000000001p53)
       expected  54.598150033144236      (0x1.b4c902e273a58p5)
       got       0.0                     (0x0.0p0).

Failure for StrictMath.pow(double, double):
       For inputs -0.9999999999999998    (-0x1.ffffffffffffep-1) and 
                   9.007199254740992E15  (0x1.0p53)
       expected    0.13533528323661267   (0x1.152aaa3bf81cbp-3)
       got         0.0                   (0x0.0p0).


Failure for StrictMath.pow(double, double):
       For inputs -0.9999999999999998    (-0x1.ffffffffffffep-1) and 
                  -9.007199254740991E15  (-0x1.fffffffffffffp52)
       expected   -7.38905609893065      (-0x1.d8e64b8d4ddaep2)
       got        -Infinity              (-Infinity).


Failure for StrictMath.pow(double, double):
       For inputs -3.0                   (-0x1.8p1) and 
                   9.007199254740991E15  (0x1.fffffffffffffp52)
       expected   -Infinity              (-Infinity)
       got        Infinity               (Infinity).

The code changes to address the bug were fairly simple; corrections were made to extracting components of the floating-point inputs and sign information was propagated properly.

Even expertly written software can have errors and even long-used software can have unexpected problems. Estimating how often this bug in FDLIBM caused an issue is difficult, while the errors could be egregious, the needed inputs to elicit the problem were arguably unusual (even though perfectly valid mathematically). Thorough testing is key aspect of assuring the quality of numerical software, it is also helpful for end-users to be able to examine the output of their programs to help notice problems.

Friday Feb 05, 2010

Recognizing all valid integral strings with regular expressions

For a fun Friday hack, this blog entry describes a program to generate a regular expression to recognize all the valid strings of integers in a given base, that is, a regular expression (regex) which accepts all in-range strings but rejects strings that would cause an overflow if converted.

This sort of regular expression could be used to verify a string won't cause a NumberFormatException when processed by one of the many text → number methods in the Java platform, such as Integer.parseInt(String s, int base). However, using a regular expression may be more expensive than attempting the conversion and catching the exception on invalid input; the regular expression discussed is possibly of more academic interest than practical significance!

An analogous regular expression for floating-point strings has been published in the javadoc for several releases.

First, is the set of strings in a given base that will convert to a 32-bit or 64-bit integer a regular language, a language that can be recognized by a regular expression? Yes, because ignoring sequences of leading zeros for a moment, there are only a finite number of strings that can be converted to integer values, one string for each of the 232 or 264 values in the int or long types, respectively. All finite languages are regular, strings of zero of more leading zeros are regular, and concatenations of regular languages are regular. Putting those facts together, the entire set of convertible strings forms a regular language. This reasoning is not dependent on the base so it holds for all integer bases, including bases 2 through 36 supported by Java's libraries.

A core regular expression of billions of strings offered as alternatives ("1|2|3|...|2147483647") while fine from a mathematical perspective, would be too slow and awkward to use. Fortunately, the pattern of valid strings has sufficient structure to yield reasonably short and manageable regular expressions.

To start we will only consider decimal strings; additionally, we will assume only ASCII digits need to be recognized. Integer values range from

-2147483648

to

 2147483647

These strings both have 10 digits. Putting aside leading zeros for the moment, all strings of 9 or fewer digits are valid, with or without a leading minus sign. A regex to recognize strings with 9 or fewer digits is trivial; less obvious is how to specify the "ragged" 10-digit nonzero strings which are in range.

The core regular expression covering non-ragged inputs is

"(-)?"+			// optional leading minus sign
"0\*"+			// leading zeros
"(\\p{Digit}{1,9})"	// numbers with 1 to 9 digits

For the 10-digit strings, if the leading digit is less than the maximum digit, all other digits can have any value:

1(\\p{Digit}{9})

If the first digit is at it's maximum, then if the second digit is less than its maximum, the third and subsequent digits can have any value:

20\\p{Digit}{8}

Likewise, if the first and second digits are at their maximum, then if the third digit is less than its maximum, the fourth and subsequent digits can have any value. Continuing this pattern for all the digit positions:

1(\\p{Digit}{9})|
20(\\p{Digit}{8})|
21[0-3](\\p{Digit}{7})|
214[0-6](\\p{Digit}{6})|
2147[0-3](\\p{Digit}{5})|
21474[0-7](\\p{Digit}{4})|
214748[0-2](\\p{Digit}{3})|
2147483[0-5](\\p{Digit}{2})|
21474836[0-3](\\p{Digit}{1})|
214748364[0-7]

This regular expression can be ORed with the regex for strings of 1 to 9 digits listed above. Finally, a separate case for the most negative value must be added:

-0\*2147483648

All together, with some newlines for readability this yields:

((-)?0\*((\\p{Digit}){1,9})
|([0-1]\\p{Digit}{9})
|(20\\p{Digit}{8})
|(21[0-3]\\p{Digit}{7})
|(214[0-6]\\p{Digit}{6})
|(2147[0-3]\\p{Digit}{5})
|(21474[0-7]\\p{Digit}{4})
|(214748[0-2]\\p{Digit}{3})
|(2147483[0-5]\\p{Digit}{2})
|(21474836[0-3]\\p{Digit}{1})
|(214748364[0-7])
)|
(-0\*2147483648)

The ragged pattern for nonzero strings with maximum possible length will exist for bases that aren't powers of 2, for the Java libraries that includes all conversion radixes other than 2, 4, 8, 16, and 32. For powers of two, the pattern is more regular. For example, in base 16 the values range from

-80000000

to

7fffffff

so for base 16 the regular expression can simply be

((-)?0\*(\\p{XDigit}{1,7}|
	[0-7](\\p{XDigit}{7}))|
(-0\*80000000)

Generalizing the regular expression generation algorithm to any given base:

  1. Create a regex for an optional leading minus sign ("-") and zero or more leading zeros, "0\*."

  2. For the base in question, generate the strings for MIN_VALUE and MAX_VALUE.

  3. Find n, the length of the MAX_VALUE string. Signed strings of that base with (n-1) or fewer characters are all valid inputs; create a regex recognizing (n-1) or fewer digits.

  4. Create a regex to recognize valid strings of length n.

  5. Combine the above regular expressions.

  6. To the above, concatenate a separate regex for the most negative value,

    -(0\*)MIN_VALUE_STRING

Taken together, this regex will recognize exactly those strings convertible to the base in question.

Untested code implementing this algorithm is available for your viewing pleasure. Any further debugging, tuning, and enhancements are left as "an exercise for the reader," happy hacking!

About

darcy

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
News

No bookmarks in folder

Blogroll