### after the deluge: how to tidy an overflow

Joe Darcy has posted a fine note on the problem of managing integer overflow in Java. This was raised again last week at the JVM Language Summit. Joe’s blog has already attracted some good comments. Here is my comment, which has inflated itself into a blog entry of its own.

The problem is to take two machine words, do a common arithmetic operation on them, and obtain the arithmetically correct results. In particular, it means that at least one bit must be collected besides the machine word of non-overflowing result data. For example (as Joe notes), the extra bit required to report overflow could be implied by an exceptional return. In any case, there are a couple of degrees of freedom in this problem: What we will use the extra bit(s) for, and how to deliever it (or them). After examining a couple of half measures, I will enumerate about ten ways to return normally with a pair of machine words.

### Use Cases

If overflow is expected to be rare (at most 10\*\*-4, say) then an argument-bearing exception is reasonable and harmless. That is, the intrinsic that performs add (or multiply) can be specified to throw an exception that contains “the rest of the story”, if that story does not need to be told often. Joe suggests putting helpful information into the exception to help slow-path code. I think the slow path is likely to be able to “start from scratch”, so exception-carried information is likely to have only a diagnostic value.

If overflow is somewhat common (10\*\*-2 or more, say), a pre-constructed exception is an option to consider. The catch for such an exception amounts to an alternate return point, which would contain slow-path code to the desired results in a robust way. The pre-constructed exception would have no information content.

If the overflowing operation is being used to build multi-precision math, overflow will be more common, even pushing 50% on random-looking inputs. (Considerations like Benford’s law suggest that random-looking inputs are paradoxically rare. Bignum addition is not likely, in practice, to create a bignum longer than its inputs, so the overflow condition will be rare on one out of the N adds in the bignum operation.) If overflow is common (more than 10%), it is important to process the requested operation as fast (or nearly as fast) as the non-overflowing case.

The last use case is the demanding one. It seems to require that the machine code generated in the relevant inner loop be nearly as good as an assembly programmer would write, using any available CPU-specific overflow indicator. This takes us, oddly enough, to the question of API design.

### API Design Cases

Various source code patterns are compatible with various optimizations. APIs which involve exceptions can be partially optimized, but not (reliably) down to the single instructions required for really tight bignum loops. So let’s talk about APIs for delivering the extra result bits on the fast path. For ease of reference in subsequent commenting on Joe’s blog, I will make a link for each type of API.

#### Throw to Slow Path

For the record, here is a fragment from Joe’s example which shows the previously mentioned technique of reporting overflow with an exception:
```static int addExact(int a, int b) throws ArithmeticException { ... }
...
int z;
try { z = addExact(a, b); }
catch (ArithmeticException ignore) { ... }
```

#### Null to Slow Path

Besides the half-measure of a slow path reached by an exception, there is another ingenious trick that Joe considers and rejects: Report the exceptional condition by returning null instead of an boxed result. The effect of this is to extend the dynamic range of the operand word (32 or 64 bits) by one more code point, a sort of `NaN` value. Since JVMs are very good at optimizing null checks, and (more recently) good at optimizing box/unbox optimizations, this trick might be easier to optimize than an equivalent throw-based trick. But it still feels like a trick, not a real API. Also, it cannot deliver additional result bits to the fast path.

#### Longs as Int Pairs

Another option would be to work only with 32-bit ints and forget about longs. Then the API could stuff two ints into a long as needed. The code for `addExact` become trivial: `return (long)a+b`. Then, for the cases where overflow is rare and simply needs detection, it is a relatively simple pattern-matching optimization to extract the overflow bit from an expression like `z!=(int)z`. This little comparison could be intrinsified, perhaps, for greater likelihood of recognition, but it is already probably simple enough.

This techinque requires stereotyped shift-and-mask expressions to pack and unpack the value pairs. (The expressions contain characteristic 32-bit shift constants.) Such expressions are probably optimized in all JVM compilers. Therefore, it is probably best, at present, to code bignums in radix 2\*\*32. (Or, if signed multiplication is a problem, perhaps radix 2\*\*31 is better.)

Although the code of such things is really simple, it is not so simple as to prevent inventive programmers from creating alternative formulations which the optimizer doesn’t expect and can’t optimize. So to make this hack optimize reliably, it is worth defining standard constructors and selectors between int-pairs and longs, and encourage programmers to use that API.

```static long intPair(int a, int b) { return ((long)a << 32) + (b & (-1L >>> 32)); }
static int lowInt(long ab) { return (int)(ab >> 0); }
static int highInt(long ab) { return (int)(ab >> 32); }
```

I have been unspecific about how many bits are really flying around. The simplest case is 32 bits of operand, and one extra bit of overflow. The pattern just presented has 32 bits of operand and 64 bits of result. A more general case is 64 bits of operand and 128 bits of result. Any exact (overflow-handling) arithmetic routine needs to return a logical tuple of `(long,long)`. This case resists half-measures that report state but not information, or that work for ints but not longs. For the rest of this note, let us consider the problem of returning a 128 bit result.

#### Multiple Return Values

You may have guessed already that my favorite (future-only) realization of this would be multiple return values. The intrinsics would look something like this:
```static (long,long) multiplySignedExact(long a, long b) { ... }
static (long,long) multiplyUnsignedExact(long a, long b) { ... }
static boolean signedOverflow(long a, long b) { return (b != (a>>63)); }
static boolean unsignedOverflow(long a, long b) { return (b != 0); }
...
(long z0, long z1) = multiplySignedExact(a, b);
if (signedOverflow(z0, z1)) { ... }
```
The idiom for detecting overflow is simple enough that it could be detected by simple pattern matching in the optimizer. Making it into an intrinsic (as noted before) helps programmers to use patterns which the compiler is expecting. The overflow detection intrinsic itself does not need any special magic. The 128-bit multiple intrinsic will almost certainly be special-cased by an optimizing dynamic compiler.

If your eyes are still stinging from reading the pseudo-code above, I don’t need to remind you that we don’t have tuples or multiple value returns right now. But there are alternatives for wrapping up the return data. The sad story is that none of the alternatives is a compelling win.

#### Arrays as Values Pairs

A simple expedient is to use a two-element array to hold the two returned values. Escape analysis (EA) is present in modern optimizing compilers. (See this example with JRockit.) One could hope that the arrays would “just evaporate”. But EA patterns are fragile; a slight source code variation can unintentionally destroy them. Most ameliorations by the programmer, such as attempting to cache the array for reuse, will break the real optimization.

#### Wrappers as Value Pairs

The same point goes for a specially constructed wrapper type modeled on the standard wrappers (like `Long`) but with multiple fields (think `LongPair`). This wrapper type would be immutable and therefore less liable to abuse, but its identity as a Java object would still be an obstacle to full optimization.

#### Value-like Pairs

If we add value types to the JVM, with relaxed identity semantics, a value-like `LongPair` would play better with EA, since pair values could be destroyed and reconstituted at will by the optimizer. Alas, that also is in the future.

#### Return by Reference

Some Java API designers have tried the C-like expedient of passing a pointer to a result variable, as an extra parameter. As far as I know, the experience with this tactic is not satisfactory. (Such a buffer in Java is usually a one-element array.) Buffer re-use is a continual temptation to the programmer, and a downfall to EA. Returning a value by poking it into a mailbox is not something Java optimizers are looking for. If the whole idiom needs to boil down to an instruction or two, allocating an array is a hazard to optimization.
```static long multiplySignedExact(long a, long b, long[] z1a) { ...; z1a[0] = z1; return z0; }
...
long[] z1a = {0}
long z0 = multiplySignedExact(a, b, z1a), z1 = z1a[0];
if (signedOverflow(z0, z1)) { ... }
```

#### Return by Thread Local

Another possible mailbox for delivering a second result would be a thread-local variable. The abstract model is like that of continuation-passing machines, in which multiple arguments and multiple return values are passed through a set of virtual registers. Java (pre-tuples) only provides access to one such register for return values, but additional registers could be modeled as thread-local values.

The optimization hazard here is that the thread local values are too widely visible, and it is hard to analyze their use in live ranges, as one can do with registers. (Specifically, they can be reached via pointers!) So the optimizer has to be careful working with side effects to such things. Still, I think this is a promising area of research.

#### Static Single Assignment

The idea of thread-locals is interesting, since they are almost structured enough to optimize as if they were registers. (This is counter-intuitive if you know that they are implemented, at least in the interpreter, via hash table lookups.) Perhaps there is a subclass of `ThreadLocal` that needs to be defined, with a restricted use pattern that can be optimized into simple register moves. (Such a standard restricted use pattern is static single-assignment, used in HotSpot.) If, after intrinsics are expanded, the second return value looks like a register (and nothing more) then the optimizer has full freedom to boil everything down to a few instructions.

#### Return to an Engine Field

So far we have supposed that the arithmetic methods are all static, and this is (IMO) their natural form. But it could be argued that arithmetic should be done by an explicit `ArithmeticEngine` object which somehow encapsulates all that goodness of those natural numbers. If you can swallow that, there is an unexpected side benefit: The engine is an acceptable (I cannot bring myself to say natural) place to store the second return value from intrinsics that must return one.
```/\*non-static\*/ long multiplySignedExact(long a, long b) { ... this.highWord=z1; return z0; }
...
ArithmeticEngine ae = new TitanArithmeticEngine();
long z0 = ae.multiplySignedExact(a, b), z1 = ae.getHighWord();
if (signedOverflow(z0, z1)) { ... }
```
With some luck in EA, the engine field (`highWord` above) might get scalarized to a register, and then participate in SSA-type optimizations.

#### Continue in a Lambda

With closures (coming to a JDK7 near you...) you could also drop to continuation-passing style (CPS) to receive the two result words from the intrinsic:
```static long multiplySignedExact(long a, long b, LongReceiver z1r) { ... z1r.receive(z1); return z0; }
...
long z1 = 0;
long z0 = multiplySignedExact(a, b, #(long z)->{z1=z});
if (signedOverflow(z0, z1)) { ... }
```
Optimizing this down to an instruction or two requires mature closure optimization, something we’ll surely get as closures mature. On the other hand, we will surely lack such optimizations at first.

Those are a lot of choices for API design! I encourage you to add your thoughts to Joe’s blog.

Comments:

(I'm writing this here, since it's not directly relevant to the JVM.)

Pure, a dynamically typed eager functional language based on generalized term rewriting, is the successor to Q, also a dynamically typed etc. In Q, which was interpreted directly, numbers could only be integers (implemented as GMP mpz_t objects) or doubles.

Pure, however, is JIT-compiled using LLVM, and it has three user-visible numeric datatypes: ints, bigints (still mpz_t objects), and doubles. Ints are known directly to LLVM, and have signed 32-bit wrapping semantics; bigints aren't and don't. The two types are essentially independent, except that ints are widened to bigints in int/bigint mixed-mode arithmetic, just as they are widened to doubles in int/double mixed mode. The three numeric types have separate literal forms (suffixed L for a bigint), though a too-big literal is quietly interpreted as a bigint literal.

All this puts the onus on the user of controlling speed vs. accuracy, but in a way which is fairly easy to understand: there are only two types of integers, and dynamic typing means that you don't have to distinguish them most of the time. When you do want to, you can attach a static type declaration to an argument, providing overloading.

"What about 64-bit machine ints?", I asked at the time. Answer: they make sense on (rare) ILP64 machines, or where you have fixnums, but when using LLVM on LP64 and LLP64 machines the boxed 32-bit int still rules. The type "long" is known only to the C FFI, and represents 32 or 64 bits according to the machine ABI; the result therefore becomes an int or a bigint under the covers. The same is done with the other C types that aren't native to Pure.

Posted by John Cowan on August 04, 2010 at 08:59 AM PDT #

The computational engine would seem the best solution to me:

1. You can easily inject different computational engines with different performance characteristics (I am assuming ComputationalEngine is an interface).

2. The ComputationEngine instance could be a ThreadLocal thus making the code thread safe.

Posted by Howard Lovatt on August 04, 2010 at 09:13 AM PDT #

Post a Comment:
• HTML Syntax: NOT allowed
##### About

John R. Rose

Java maven, HotSpot developer, Mac user, Scheme refugee.

Once Sun and present Oracle engineer.

##### Archives
Sun Mon Tue Wed Thu Fri Sat « December 2016 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Today
RSS Atom