The problem is to take two machine words, do a common arithmetic operation on them, and obtain the arithmetically correct results. In particular, it means that at least one bit must be collected besides the machine word of non-overflowing result data. For example (as Joe notes), the extra bit required to report overflow could be implied by an exceptional return. In any case, there are a couple of degrees of freedom in this problem: What we will use the extra bit(s) for, and how to deliever it (or them). After examining a couple of half measures, I will enumerate about ten ways to return normally with a pair of machine words.
If overflow is somewhat common (10\*\*-2 or more, say), a pre-constructed exception is an option to consider. The catch for such an exception amounts to an alternate return point, which would contain slow-path code to the desired results in a robust way. The pre-constructed exception would have no information content.
If the overflowing operation is being used to build multi-precision math, overflow will be more common, even pushing 50% on random-looking inputs. (Considerations like Benford’s law suggest that random-looking inputs are paradoxically rare. Bignum addition is not likely, in practice, to create a bignum longer than its inputs, so the overflow condition will be rare on one out of the N adds in the bignum operation.) If overflow is common (more than 10%), it is important to process the requested operation as fast (or nearly as fast) as the non-overflowing case.
The last use case is the demanding one. It seems to require that the machine code generated in the relevant inner loop be nearly as good as an assembly programmer would write, using any available CPU-specific overflow indicator. This takes us, oddly enough, to the question of API design.
static int addExact(int a, int b) throws ArithmeticException { ... }
...
int z;
try { z = addExact(a, b); }
catch (ArithmeticException ignore) { ... }
NaN
value. Since JVMs are very good at optimizing null checks, and (more recently) good at optimizing box/unbox optimizations, this trick might be easier to optimize than an equivalent throw-based trick. But it still feels like a trick, not a real API. Also, it cannot deliver additional result bits to the fast path.addExact
become trivial: return (long)a+b
. Then, for the cases where overflow is rare and simply needs detection, it is a relatively simple pattern-matching optimization to extract the overflow bit from an expression like z!=(int)z
. This little comparison could be intrinsified, perhaps, for greater likelihood of recognition, but it is already probably simple enough.This techinque requires stereotyped shift-and-mask expressions to pack and unpack the value pairs. (The expressions contain characteristic 32-bit shift constants.) Such expressions are probably optimized in all JVM compilers. Therefore, it is probably best, at present, to code bignums in radix 2\*\*32. (Or, if signed multiplication is a problem, perhaps radix 2\*\*31 is better.)
Although the code of such things is really simple, it is not so simple as to prevent inventive programmers from creating alternative formulations which the optimizer doesn’t expect and can’t optimize. So to make this hack optimize reliably, it is worth defining standard constructors and selectors between int-pairs and longs, and encourage programmers to use that API.
static long intPair(int a, int b) { return ((long)a << 32) + (b & (-1L >>> 32)); }
static int lowInt(long ab) { return (int)(ab >> 0); }
static int highInt(long ab) { return (int)(ab >> 32); }
I have been unspecific about how many bits are really flying around. The simplest case is 32 bits of operand, and one extra bit of overflow. The pattern just presented has 32 bits of operand and 64 bits of result. A more general case is 64 bits of operand and 128 bits of result. Any exact (overflow-handling) arithmetic routine needs to return a logical tuple of (long,long)
. This case resists half-measures that report state but not information, or that work for ints but not longs. For the rest of this note, let us consider the problem of returning a 128 bit result.
static (long,long) multiplySignedExact(long a, long b) { ... }
static (long,long) multiplyUnsignedExact(long a, long b) { ... }
static boolean signedOverflow(long a, long b) { return (b != (a>>63)); }
static boolean unsignedOverflow(long a, long b) { return (b != 0); }
...
(long z0, long z1) = multiplySignedExact(a, b);
if (signedOverflow(z0, z1)) { ... }
If your eyes are still stinging from reading the pseudo-code above, I don’t need to remind you that we don’t have tuples or multiple value returns right now. But there are alternatives for wrapping up the return data. The sad story is that none of the alternatives is a compelling win.
Long
) but with multiple fields (think LongPair
). This wrapper type would be immutable and therefore less liable to abuse, but its identity as a Java object would still be an obstacle to full optimization.LongPair
would play better with EA, since pair values could be destroyed and reconstituted at will by the optimizer. Alas, that also is in the future.static long multiplySignedExact(long a, long b, long[] z1a) { ...; z1a[0] = z1; return z0; }
...
long[] z1a = {0}
long z0 = multiplySignedExact(a, b, z1a), z1 = z1a[0];
if (signedOverflow(z0, z1)) { ... }
The optimization hazard here is that the thread local values are too widely visible, and it is hard to analyze their use in live ranges, as one can do with registers. (Specifically, they can be reached via pointers!) So the optimizer has to be careful working with side effects to such things. Still, I think this is a promising area of research.
ThreadLocal
that needs to be defined, with a restricted use pattern that can be optimized into simple register moves. (Such a standard restricted use pattern is static single-assignment, used in HotSpot.) If, after intrinsics are expanded, the second return value looks like a register (and nothing more) then the optimizer has full freedom to boil everything down to a few instructions.ArithmeticEngine
object which somehow encapsulates all that goodness of those natural numbers. If you can swallow that, there is an unexpected side benefit: The engine is an acceptable (I cannot bring myself to say natural) place to store the second return value from intrinsics that must return one./\*non-static\*/ long multiplySignedExact(long a, long b) { ... this.highWord=z1; return z0; }
...
ArithmeticEngine ae = new TitanArithmeticEngine();
long z0 = ae.multiplySignedExact(a, b), z1 = ae.getHighWord();
if (signedOverflow(z0, z1)) { ... }
highWord
above) might get scalarized to a register, and then participate in SSA-type optimizations.static long multiplySignedExact(long a, long b, LongReceiver z1r) { ... z1r.receive(z1); return z0; }
...
long z1 = 0;
long z0 = multiplySignedExact(a, b, #(long z)->{z1=z});
if (signedOverflow(z0, z1)) { ... }
Those are a lot of choices for API design! I encourage you to add your thoughts to Joe’s blog.
(I'm writing this here, since it's not directly relevant to the JVM.)
Pure, a dynamically typed eager functional language based on generalized term rewriting, is the successor to Q, also a dynamically typed etc. In Q, which was interpreted directly, numbers could only be integers (implemented as GMP mpz_t objects) or doubles.
Pure, however, is JIT-compiled using LLVM, and it has three user-visible numeric datatypes: ints, bigints (still mpz_t objects), and doubles. Ints are known directly to LLVM, and have signed 32-bit wrapping semantics; bigints aren't and don't. The two types are essentially independent, except that ints are widened to bigints in int/bigint mixed-mode arithmetic, just as they are widened to doubles in int/double mixed mode. The three numeric types have separate literal forms (suffixed L for a bigint), though a too-big literal is quietly interpreted as a bigint literal.
All this puts the onus on the user of controlling speed vs. accuracy, but in a way which is fairly easy to understand: there are only two types of integers, and dynamic typing means that you don't have to distinguish them most of the time. When you do want to, you can attach a static type declaration to an argument, providing overloading.
"What about 64-bit machine ints?", I asked at the time. Answer: they make sense on (rare) ILP64 machines, or where you have fixnums, but when using LLVM on LP64 and LLP64 machines the boxed 32-bit int still rules. The type "long" is known only to the C FFI, and represents 32 or 64 bits according to the machine ABI; the result therefore becomes an int or a bigint under the covers. The same is done with the other C types that aren't native to Pure.
The computational engine would seem the best solution to me:
1. You can easily inject different computational engines with different performance characteristics (I am assuming ComputationalEngine is an interface).
2. The ComputationEngine instance could be a ThreadLocal thus making the code thread safe.