A Day with PyPy
By John.Rose-Oracle on Nov 13, 2007
Yesterday, Charles Nutter and I played host to a roving band of PyPy researchers. Their aim is to get people interested in their new techniques in dynamic language compilation, and they succeeded with us. I had a special goal to figure out how the JVM might be useful in their world.
Here are some bits from the conversation that stick in my memory...
When dynamic language implementors say “C speed” they are talking about the speed of the original C-coded interpreters of their language. The original technology is almost always an AST-walking interpreter, which is about the slowest way to go about it. By contrast, when JVM implementors say “C speed”, they are comparing a Java program with a transliterated C or C++ program, compiled, optimized, and statically linked. Those two standards are separated by two or three orders of magnitude.
The PyPy idea of promotion includes the idea of capturing runtime values for use by the compiler, by lazily growing a PIC (polymorphic inline cache). A PIC is a dynamic call site (or other branch point) which directs control flow based on selectors (often class IDs). It also keeps a record of those selectors as they appear. PyPy characterizes this as a “growable switch”. As each new selector appears, the PIC grows another branch edge, to lazily linked code which is appropriate to that selector. This is more or less classic. (HotSpot relies on interpreter profiles more than PICs, but most implementations of the forthcoming
invokedynamic instruction will probably create PICs.)
Also, PyPy promotion involves lazy compilation, even at the basic block level. The code “appropriate” to a selector might be immediately linkable, or it might not yet exist. In the latter case the JIT creates the code, potentially in the full type and IR context of the caller, with the particular selector. That is how it is promoted: The JIT IR downstream of the switch can “see” the selector value as a compile-time constant.
Moreover, the generation does not stop with the end of the callee method (assuming the PIC was for a method call), but it continues with the next statement of the caller. In this way, the promoted selector value is available downstream of the call. The compilation is not just of the method call, but of the caller’s whole continuation from the point of the call. I think I want to call this kind of JIT a continuation JIT, because around here JITs do not use continuation passing style. (Other VMs do this—kudos.)
The cost of all this is frequent JIT runs, potential exponential code blow-up, and some trouble getting loops right. Another problem is the perennial trade-off between thoroughness and agility: The PyPy JIT runs most naturally on small pieces of code, with a correspondingly small window of context for the optimizer. To get global optimizations (such as loop transformation that HotSpot provides) PyPy code will need a second tier of recompilation, once the dynamic flow graph has settled down a bit. All in all, it is a very promising approach.
Since PyPy PICs do not return anywhere, the underlying platform really needs to supply a reliable tail-call optimization (TCO). We found some workarounds for the current set of JVMs, but the lack of TCO gets in the way of directly optimizing code produced by a continuation JIT... or a trace-based JIT.
PyPy promotion does not apply only to the class ID of a message receiver (as with classic PICs). A promoted switch selector can be a byte loaded from a bytecode stream, which is the way they partial-evaluate their interpreter loop into a templating JIT. A selector could also be composed from two or more receiver classes, which is how one could do CLOS-style multiple dispatch (for generic binary arithmetic, etc.).
We talked about method replacement, especially the kind where a method is replaced by a better-optimized version. The
invokedyanmic instruction, combined with method handles, will provide some hooks for this. The HotSpot JVM can rewrite groups of stack frames (in both directions: optimizing and deoptimizing). We realized that first-class continuations in the JVM would allow Java code (in a dynamic language runtime) to express such transformations effectively. (Not surprising in hindsight, since my prototype version of
call/cc on the JVM uses
vframes, which is the mechanism HotSpot uses for deoptimization.) First-class continuations could also help with stack inspection operations and coroutines.
We talked about class splitting, in two cases: boxed
Dictionary. In the latter case, you want a special system-designed subclass for (say) string keys only. In the former case, you want a special system-supported representation for small integer values which could be packed into an immediate field in a tagged pseudo-pointer.
The intermediate representation is a graph parsed from an RPython (restricted Python) program. RPython is a true subset of Python, but one which avoids the really dynamic features. To a first approximation, it is the part of Python with Java semantics: It can be statically typed. Another restriction is that its world of types is closed, so the compiler can enumerate the type hierarchy statically. (So-called application classes, loaded dynamically need more metadata, an extra level of indirection away.)
Like any fully dynamic language, the RPython optimizer performs scalarization pretty eagerly. (Or if you like, it performs boxing lazily and unwillingly.) This is needed for unboxing the normal Python integer type, and also for decomposing things like argument lists. It doesn’t work so well (yet) outside the closed world of RPython.
Why does RPython makes a good intermediate language? Partly because it is block-structured, typeable, easy to program in, has nice types like lists, and is a true subset of a rich parent language (Python). So far, Java is like that also. Probably Python would support little languages for specialized problem spaces like pattern matching or template generation, but perhaps the PyPy people feel that would be a programming luxury they cannot afford right now.
The main edge RPython has over Java as an intermediate language is its complete dynamic typing, which allows the same code to work on compile-time and run-time representations of objects. (It sounded like abstract interpretation to me, but they say it is more a species of partial evaluation.) There is a compile-time object space which mirrors the regular (R)Python object space.
I would think that doing abstract interpretation over compile-time booleans would work best when the
if/then/else statements, etc., are represented with functions and closures. Then, as Yogi Berra said, “When you come to a fork in the road, take it.” A side effect of each branch would be generating the IR for that branch, with the boolean promoted to a constant. One of the trade-offs in RPython, though, is to eliminate closures. This simplifies variable analysis.
They were excited to hear about our nascent multi-language VM project. I expect we can collaborate there on some experiments with
All in all, it was an interesting and enjoyable day.