an experiment with generic arithmetic
By john.rose on Jul 21, 2010
(By the way, structuring this problem is much more than an exercise in classic object-oriented programming, since arithmetic operations have symmetries which cannot be modeled by sending a request to an encapsulated method in the left-hand operand. Some sort of double dispatch is required, which is an interesting problem, which I won't go into. Acid test: Design your generic arithmetic framework so that it simulates Java arithmetic, but allow users can add new number classes, starting with
BigDecimal. Now test it by adding rational numbers and complex numbers, as two separate and independent modules. If that was easy, add units or formal polynomials. Don't forget that your users are expecting to perform arithmetic on integer iteration variables of their inner loops, and they refuse to declare a static type for it.)
The first trick of optimizing generic arithmetic is to assume that each particular operation (let's say it an addition) is going to apply to a limited range of operand formats, usually a single format, usually a machine word (boxed and tagged neatly, in an object). Regardless how many exotic number types are running around globally, each local neighborhood is pretty boring. The theory of invokedynamic is that each operation (a call site in the JVM bytecodes) is separately linked by a bootstrap method. Even if two sites look just the same, each one gets its own bootstrap process. This allows (but does not require) the invokedynamic linkage logic (which is part of the language runtime, not the JVM) to assign separate profile points to each individual operation. Invokedynamic allows the JVM to separately customize and optimize each arithmetic operation, according its operand types and context.
To put this theory into practice, I tried a first experiment combining invokedynamic with Kawa, a mature implementation of Scheme on the JVM. Kawa adds dynamically typed numbers together with routine called
AddOp.$Pl. As suggested above, it takes two object references, classifies them as numbers, converts them to a common type (
Double, whatever) and adds them together according to that common type.
You might expect that the classification logic is expensive, and it is, compared to the actual cost of adding two machine words. The main cost, however, appears to be the allocation of intermediate values on the heap. In arbitrary approximate units, the cost of classifying operand objects is 1 to 3, the cost of adding their "payloads" is less than 0.1, and the cost of boxing the result is 5.
The best way to speed up such computations, therefore, is to elide the allocations, either some interprocedural analysis or handshake that keeps the payloads in registers or on the stack, or to encode small payloads in pseudo-pointers. Failing that, we are always working to make GC faster. I suppose JVM engineers need to do all three...
Meanwhile, there is the interesting problem of classification. The range of costs noted above is affected by the success or failure of type profiling. JVMs and other dynamically typed systems usually keep track of operand types (in many clever ways) so that they can simplify and specialize their execution. Specifically, although the Kawa routine
AddOp.$Pl is ready to handle numbers of all types, if it is presented with only (say)
Integer operands, the HotSpot JVM will notice that the internal classification tests never observe anything but that one type, and it will adjust the machine code for that routine to work well for that type, and "bail out" to an expensive recompilation if the assumption changes. If this tactic works, the total cost of adding two
Integers together is 6 units. Otherwise, it is 8 units.
Using invokedynamic, we can add to the JVM's profiling information. The trick is to have the bootstrap method link in invokedynamic site with a method handle that observes operand types, briefly records them, and then specializes itself to a new method handle that works better for the previously observed operand types. My experiment (above) is very simple: The call site observes the type of one operand of one call, and spins an "if/then/else" which says, "as long as future operands are also of that type, cast to that type, and call
The effect of the extra cast is to send specialized type information into the
AddOp.$Pl routine, which enables it to fold up, about the same as if the JVM's global type profile had succeeded. In essence, invokedyanmic works as a front-end to the
AddOp.$Pl routine, helping it to classify its operands. The overhead of the invokedynamic instruction is about 1 unit (which is higher than it should be, but that's young technology for you). But, the extra profiling prevents the worst case classification cost of 3 units, keeping it down to 1.
(The comment at the top of the benchmark has more information about how these numbers were obtained. Full disclosure: This experiment was run on the latest mlvm experimental build. Also, the JVM's inlining heuristics, designed circa 1999, had to be twisted around manually in order to keep the JIT interested in processing the program in large enough units. To get valid system-level results, we'll have to engineer better inlining tendencies into the JVM's logic.)
This result is encouraging for a few reasons. First, under the realistic assumption of a polluted type profile, invokedynamic can reduce the cost of generic arithmetic by double-digit percentages. That margin will improve as JVMs improve the quality of invokedynamic code generation. See the upcoming paper by Thalinger & Rose in PPPJ 2010 for more observations on this point.
Second, this particular experiment required no change at all to the Kawa runtime routine. The new JVM features can be used in an evolutionary way. But deeper changes can get better results. For example, dynamic languages sometimes have "metaobject protocols" which allow the dynamic type system to communicate (in a modular way) with the compiler. A good MOP can encode optimization decisions about generic arithmetic sites, and invokedynamic provides a clean way to plug those decisions into the JVM's generatedcode.
Third (and most important) is the fact that the temporary allocation, although it is by far the highest cost, is also vulnerable to local analysis. After a type-customized version of
AddOp.$Pl is inlined at an invokedynamic site, it becomes an exercise in escape analysis (or fixnums; take your pick) to remove the garbage collector from the picture. This is a work in progress for us JVM folks...