tuples in the VM

Or, juggling with more than one ball.

Introduction

For several years there have been ideas afloat for marking Java classes as "value-like", and treating them like (named) structs. This seems to be a necessary part of an initiative in Java to support types like Complex necessary to numerical programming. There have also been proposals for pure "tuple" types (e.g., ordered pairs (T,U) for any types T, U). This note proposes some incremental changes to the present JVM architecture which would support value-like classes and tuples.

The basic semantics of value classes in Java can be summed up as (a) a marker of some sort that distinguishes value-like classes from ordinary (identity-like) classes, and (b) reinterpretation of all operations in the Java language to ignore the identity of instances of the class, and instead operate on the field values as a group. Such a group of field values can be modeled as (compiled to) a tuple. In order to support name equivalence, the tuple needs to be joined (at compile time) by a "tag" which distinguishes it from similar tuples representing differently-named classes.

JVM Support for Tagged Tuples

We propose a tagged tuple signature as a new kind of signature string, consisting of a class signature followed by a series of zero or more arbitrary type signatures (including nested tagged tuples). Every tagged tuple signature is surrounded by curly braces '{}' which unambiguously separate it from any nearby signature parts.

  signature =
      ONEOF "ZBCHIFJD"
    | 'L' classname ';' | '[' signature
    | tagged_tuple | plain_tuple
  tagged_tuple = '{L' classname ';' signature\* '}'
  plain_tuple = '{V' signature\* '}'

A tagged tuple represents a single value of the given class (which might be a value class), represented as a series of typed values grouped in a "tuple".

A plain tuple is like a tagged tuple, except that there is no value class mentioned at the head of the signature; the character 'V' holds its place.

The character 'V' cannot occur within any type signature except at the head.

Like Java 1.5 generic type signatures, tagged tuple signatures have an "erasure" which removes type structure and normalizes down to a JVM 1.1 standard signature. The erasure of a tuple (of either type) omits the curly braces and the type signature component immediately after the open brace.

   ERASE "{Ljava/math/Complex;DD}"  ==  "DD"

Unlike Java 1.5 generic type signatures, tuple markings are significant to the linker and verifier. Two methods with the same class, name, and erased signature are different if their unerased signatures differ. Methods of distinct unerased signatures cannot link to each other, simply because their signatures differ, as Utf8 strings.

This significance means that tools which manipulate method and field descriptors will need to be updated. The cost of such tool updates is probably the major objection to an enhancement of the basic signature language (as opposed to the creation of a second language, as in generics). It seems that any scheme that supports multiple-value returns will always require tool updates, so it may not be possible to move the enhanced signatures to another attribute, as was done with generics.

Method Calls

The method invocation bytecodes ('invokestatic', etc.) treat tuple types specially. When applied to a method whose signature contains tuples, an invoke performs the same motions of stack data as it would if calling a method with the erased signature. For example, the following two methods have different signatures but the same calling sequence:

  invokestatic   pow   ({Ljava/math/Complex;DD}I)Ljava/math/Number;
  invokestatic   pow   (DDI)Ljava/math/Number;

In both cases, two doubles and an int are popped from the caller stack, and the returned reference (of type Number, heap-allocated) is pushed on the stack. (Note that Complex could also be returned on the heap. The VM is neutral as to stack or heap allocation policies of value classes.)

If the erasure leaves an empty string for the method return signature, the string "V" is used instead. If the erasure leaves more than one type signature for the return, the caller receives multiple values on the stack. (A new 'vreturn' bytecode is required to generate these values.)

Method Entry and Exit

Likewise, when a method with a tuple-bearing signature is entered, the values stored in the locals are of the same placement and type as if the method's signature were erased of tuples.

If the method's return signature erases to the null string, the 'return' bytecode must be used to exit the method. if the method's return signature erases to a single signature, that signature (which cannot be 'V') determines the type of return bytecode required. If the method's return signature erases to two or more type signatures, the method must use the 'vreturn' bytecode to exit.

The 'vreturn' bytecode is always a valid exit instruction, for any method whatsoever. It operates as follows:

  • the returning method's return signature is inspected
  • the return signature is erased of tuples
  • if the return signature is 'V', it is taken to be empty

Finally, for every individual value signature in the resulting signature, in right-to-left order, a corresponding typed value is popped from the method's stack. These values are transmitted back to the callee, and pushed onto the top of the callee stack in the same order that they were stored on the method's stack.

Note that 'vreturn' potentially renders all other return instructions obsolete.

Object Field Layout

The following is an optional feature. It can be simulated by emitting multiple scalar fields for each tuple field in a class definition. Supporting this feature will complicate some reflective operations.

The binary representation of a class can contain a field with a tuple signature. The effect of such a signature is to allocate multiple variables for the field, one for each type signature in the erasure of the signature. These variables are stored at JVM-dependent locations; there is no guarantee that they occupy contiguous storage, though they might in some VM implementations.

The bytecodes allow these variables to be read either as a group or as individual scalar components.

(The simulation of this with standard JVM fields would replace a field 'phase' of signature '{Ljava/math/Complex;DD}' by fields named 'phase.0' and 'phase.1' with signature 'D'. In order to preserve type checking of tuple tags, the signature must be decorated by appending the tuple type itself, yielding 'D{Ljava/math/Complex;DD}' or 'D{Ljava/math/Complex;DD}'.)

Scalar Field Read and Write

The getter and setter bytecodes ('getfield', 'putstatic', etc.) continue to operate on single values, either of primitive or reference type. When applied to a field with a tuple signature, the field signature must contain an additional decimal numeral. This numeral N selects the field variable corresponding to the Nth value (zero-origin) in the erasure of the field's tuple type. (Note that nested tuple signatures flatten into a single list of primitive and reference types.) The signature of the bytecoded reference (minus the numeric suffix) must agree with the signature of the field itself; the verifier or linker checks this.

Thus, the following field references access the two double variables of a complex field of signature '{Ljava/math/Complex;DD}': getfield phase {Ljava/math/Complex;DD}0 getfield phase {Ljava/math/Complex;DD}1

Tuple Field Read and Write

The following is an optional, complex feature.

There is a new instruction prefix 'vaccess' which may precede any of the four getter or setter bytecodes. The effect is to force the assignment to operate on all the variables of a field, pushing onto or popping them off of the stack. This prefix is provided as an abbreviation, and a bytecode prefixed by 'vaccess' is precisely equivalent to a similar series of scalar accesses, except that the field's unerased type must match the bytecoded reference signature. The order of the equivalent series of accesses is (as one might expect) in order from zero.

The bytecoded reference signature may be either erased or an unerased tuple. In the former case, the signature is compared against the erased type of the field. In the latter, the signature must match exactly.

Thus, the two field references of the previous example would be equivalent to the following single reference:

  vaccess getfield    phase   DD

All previous accesses would also be valid for a field of type '{LPoint;DD}'. However, an access with the unerased signature would signal an error for tuple field not specifically marked as Complex: vaccess getfield phase {Ljava/math/Complex;DD}

Tuple Data Transfer

There are also tuple-related variants 'vload' and 'vstore' of the local variable access bytecodes. They take a signature constant pool reference in the bytecode stream, before the local number (which is always in 2-byte width). They transfer a number of typed values between stack and locals; the precise sequence of values is determined by the signature named by the constant pool reference, which must be a CONSTANT_Utf8 string.

These bytecodes are provided as an abbreviation, and can be exactly simulated by a series of scalar loads or stores.

JVM Neutrality

It is important to note that the JVM makes no distinction between value and non-value classes. Most processing of tuple type signatures depends only on the erasure of those signatures. The chief exception to this rule is that method and field linkage requires an exact match of unerased signatures. However, all calling sequences and object layouts are driven by erased signatures.

The JVM makes no restriction on simultaneous manipulation of both heap-allocated instances of classes and their tuple representations. It is likely that the language will contemplate conversion between these representations but the JVM has no specific support for them.

The JVM allows bytecoded operations which maybe inappropriate to value objects. There is no need to reinterpret these operations for value classes; it is enough for the compiler to refrain from emitting them. For example, reference comparison bytecodes are probably useless with value objects, if their semantics are intended to be free of object identity. But there is no need for 'acmp' bytecodes to call Object.equals.

Java Reflection

The Java Reflection API works mainly at the language level, and so would follow whatever conventions (if any) were settled for processing of tuples in standard Java. Historically, much of reflection has been implemented in side the JVM, but implementation of reflective access to tuples would probably be implemented wholly in Java, on top of low-level unsafe operations and/or dynamically generated adapter classes.

It is likely that the tuple-enhanced language will provide a canonical translation between groups of typed values and references to heap objects carrying those values. For example, each value class is likely to support a constructor whose argument signature is the tagged tuple for that class. A tuple is likely to be representable on the heap as a fixed-length Object array or as a generic immutable utility class, with primitive component values in wrapper classes (as in varargs).

The JVM per se does not need to decide such matters, but the reflection API must follow the lead of the language, and properly convert raw tuples into the appropriate boxed representation, when passing values into and out of methods, or when loading and storing field values. (Recall that the Java Reflection API uses 'Object' as a universal wrapper for values of all types; this should extend to both tagged and plain tuple types.)

Any low-level unsafe API for making field and method references is likely to ignore tuples, and operate only on erased types. The 'field offset' query functions of sun.misc.Unsafe must be extended to report a distinct offset for each scalar component of a tuple-typed field; this requires a variant for each query functions which takes an integer parameter selecting the Nth tuple component of a field.

Notes on Compiling Java with Values and Tuples

The javac compiler should strive to keep value objects and tuples in "exploded" form and only box them in the heap when a conversion requires it.

There will be occasions where the compiler will want to keep track of both representations at once, to avoid repeated conversions.

An assignment to a pure tuple component is likely to require the compiler to forget a cached box.

Sometimes a boxed value or tuple will "escape"; for example, it might be the argument to List.add(Object). If the relevant value is a purely final value class, the compiler may continue (as a matter of arbitrary choice) to cache the box and reuse it later. Otherwise, it must "give up" the escaped box, and reconstruct a new one if a further coercion is required. The reason for this is that an escaped value object with non-final fields is subject to mutation.

Value objects with final fields and those with non-final fields are both reasonable. To preserve value semantics, their usage patterns will differ, since the identity of mutable objects is observable in more ways than the identity of immutable objects, and (one presumes) any language design will strongly insulate the programmer from perceiving the identities of boxed value objects.

Mutable value objects will have the possibility of aliasing, leading to the observation of unintentionally shared changes. The language should work hard to reduce or even eliminate aliasing, by defensively copying tuple values carried by boxes. For example, this code might end up reallocating:

    List<Complex> l1, l2;
    Complex c = l1.get(0), c2 = c;
    c2.imag = 0;
    assert(c != c2);  // no sharing here
    l2.add(c);  // defensive copy necessary

On the other hand, the following code might mutate a box in place:

    l1.get(0).imag = 0;

This is only reasonable if there is some theory under which List elements do not alias. This complicates generic algorithms like Collections.nCopies, which freely replicates references to list components. (If mutable value objects are tagged by an interface which includes a "defensive copy" operator, the generic algorithms can check for this.)

On the other hand, componentwise assignment to immutable types makes sense, as a shorthand for reassigning the whole tuple with one differing component. Suppose Complex is immutable; then the last two lines are equivalent:

    Complex k = Complex(3,4);
    k.imag = 3;
    k = Complex(3,4);

A Bonus Signature Hack: Setter Methods

The desired to do componentwise update on immutable objects immediately leads to a requirement for more formal correspondence between getter and setter functions, so that the following equivalence can work:

   l1.get(0).imag = 0;
   // short for and same effect as:
   Complex tem = l1.get(0); tem.imag = 0; l1.set(0, tem);

The challenge here is that the correspondence between methods like List.get and List.set is documented only in human-readable form, or perhaps in some framework built on top of the language (such as Java Beans). In order to define "in place update" of values obtained via an access method, there must be a more formal correspondence established.

This correspondence can be done in the JVM with signatures also:

   get   (I)Ljava/lang/Object;
   get   (I=Ljava/lang/Object;)V
   set   (ILjava/lang/Object;)V

The latter two method descriptors would be (for List) made equivalent by some as yet unspecified means, and classes would be allowed to declare setter methods that are formally coupled to getter methods.

Once again, the JVM could largely ignore the extra decoration by using an erasure convention (which would eliminate the '=' in this case). The presence of the extra character would simply disambiguate the setter from some other getter.

The language needs to supply a way to declare these setter methods, something like this:

   void get(int i, = Object x) { this.set(i, x); }
   void foo() { this.get(23) = "value #23"; }

Setter methods for immutable objects would need to return the updated version of self:

   BigInteger testBit(int n, = boolean z)
     { return z ? setBit(n) : clearBit(n); }
   void foo(BigInteger b) { b.testBit(23) = true; }

A alternative (less clean) technique would be to decorate the method name itself, like this:

   void 'get='(int i, Object x) { this.set(i, x); }

There is a significant retrofitting problem here (but not an unsolvable one) with all old Java APIs that feature getters and setters. A method aliasing facility, based on optional class file attributes, might be helpful.

(Note: This proposal dates from 8/2004.)

Comments:

A very interesting post. I submitted a RFE that proposed immutable types for Java:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6389769

And have implemented them in my pet project, Pattern Enforcing Compiler (PEC). PEC is a compiler that extends the type system to understand design patterns. PEC is available here:

http://pec.dev.java.net/nonav/compile/index.html

Whilst the compiler does not at present in-line immutable types; it could and a future version may well do so. The design chosen was to in-line, stack allocate, immutables since the example:

Complex c1 = new Complex(1,2);
Complex c2 = c1;
c2.im = 0;
out.println(c1);

is confusing. Does it print (1,2) or (1,0)? If Complex is a normal class, it prints the later. If complex is in-lined, the former. C# suffers from this problem with its struct construct. If you only in-line immutables then the above example is a compile time error if Complex is immutable and hence no confusion is possible.

Posted by Howard Lovatt on July 19, 2007 at 03:13 PM PDT #

Do you plan some additional support for arrays, in particular allowing by-value arrays of composed types? The erased-tuples seems to be a very clever and useul way to implement most features of structs / lightweight objects with small impact on the toolchain, API compatibility, typesystem complexity etc. just like Tiger did for generics. But arrays are the icing on the cake. Many maths / physics / engineering / 3D hackers will tell you that efficient Complex objects are good, but efficient Complex[] are ten times better. You only cover collections, which will require boxing, that's no good. If I understood the proposal correctly, it might support value storage of homogeneous objects (perhaps with some additional tricks or limitations), e.g. a Complex has two doubkle fields so you could disguise a double[200] as a Complex[100], using each pair of doubles per Complex tuple. Heterogeneous objects (those which fields are not all the same type) wouldn't be supported, since we only have homogeneous arrays. If this is too hard to fix, even by-value arrays restricted to homogeneous tuples would be very valuable (even if non-orthogonal = ugly), because many use cases that benefit of that feature involve homogeneous data... for example in 3D graphics Complex, Point2D, Point3D, Triangle, etc.

Posted by Osvaldo Pinali doederlein on July 30, 2007 at 03:33 AM PDT #

Howard: Actually integrating value-like types into Java raises difficult design issues, as you point out clearly. My purpose for this post is to indicate that adding tuples to the JVM is fairly simple, and does not require solution of the language design problems.

Osvaldo: Thanks for the thoughts. I agree the proposal should be rounded out with heterogeneous array types. They would be accessed with 'vaload' and 'vastore' instructions, which (unlike 'vreturn') would need a c-pool reference to help the interpreter type the array without help from the verifier. But I want to get comfortable first with the design for the 'getfield' and 'putfield' instructions. Actually, given heterogeneous array types, the JVM should also support inlined arrays. (One per instance, final classes only.) The idea would be to allow both 'getfield' and 'vaload' instructions on the same object. This would let us do strings and other handle/body structures in one memory block--saving numerous indirections.

Posted by John Rose on July 30, 2007 at 12:42 PM PDT #

John: Interesting... you seem to be moving towards a VM architecture that relies more in "unsafe" (unmanaged) operations for either performance or features (in this case, working around the Java typesystem's limitations). This adds to sun.misc.Unsafe and a growing list of HotSpot intrinsics. This trend remembers me of NaturalBridge's BulletTrain static compiler: many aeons ago, they created some proprietary unsafe bytecodes (not usable from non-core classes) so that they could implement most of their runtime in an "unsafe Java", which is much better than C even if it's not as nice as pure Java. I always wondered why Sun never adopted this obvious idea, there's a lot of native code in the core (not to mention third-party libs) that is native only because it needs to perform something forbidden, not for performance or other reasons. Having said that, I wonder if this raw memory access for heterogenous arrays won't create a problem for optimization, since optimizers like to know the type of everything in the stack and heap. Same goes for precise GC. I guess you can produce a stackmap-like map for those arrays. Now, instead of a large number of v\* bytecodes, why not having a single 'value' bytecode prefix fro all bytecodes that need extended value versions? So we have 'value return', etc. The frequency of these bytecode pairs will be relatively small, so code bloat is minimal... we only have some ~50 unassigned opcodes, not a lot if you consider my hopes of \*properly\* supporting features like reified generics, tail calls, dynamic calls and others that will follow ;-) btw why not assigning some bytecodes (or fix other classfile specs) to properly support a few things that the current spec sucks in, like initialization of arrays and objects with trivial constructors (I'd want them all to be constant-pooled), use of Iterators (lots of bloat and slowness unless escape analysis-based stack allocation happens), etc.

Posted by Osvaldo Pinali Doederlein on July 31, 2007 at 09:49 AM PDT #

I like this proposal. It sounds as though it implements something that I wanted to see, which is the ability to return multiple values.

Especially in performance-critical code, I would like to be able to write the following code:

int a; int b;

(a,b) = obj.method();

There is no rule that says that a function should only return a single value, and this certainly shouldn't be something that the JVM forces on all languages.

Also it is worth noting the performance angle of your proposals.

In allowing multiple values to be returned on the stack, you are saving the creation and garbage collection of an object, which is no trivial feat. It can easily involve jumping around several different areas of memory, thus flushing other data out of L1 and L2 cache. The great thing about the stack is that it is a contiguous area of memory that is frequently used, and therefore not going to get flushed from the cache.

Another beneficial aspect is that you are allowing code to be compiled in a way where the compiler makes it clear to the JVM that the data is for that thread only. In multi-core/multi-processor systems, then this is quite relevant. Effectively we have a stack-based object.. which I'd love to be able to declare and use.

Thanks for the hard work to document an interesting and useful idea.

Posted by Neale on August 01, 2007 at 01:31 AM PDT #

Osvaldo: There's nothing unsafe or unmanaged about these proposals, since the VM still knows the type of every value it handles. The Unsafe class (and related ideas with native methods) is a completely distinct conversation. Yes, heterogeneous arrays would need oop-maps much like regular instances do. My, you have big dreams for the JVM... I like that. The idea of adding a prefix to "widen" affected opcodes is a good one; thanks--and there is already a 'wide' instructin prefix the VM can use. The affected instructions might be 'aastore', 'aaload', and 'areturn'.

Neale: Thanks. And the advantages of unboxed return values also apply to unboxed arguments. Really, arguments and return values are much more symmetrical than they appear to be in most languages. Returning only one value is like being able to toss several values in the air, but only being allowed to catch one--hence the byline of this posting.

Posted by John Rose on August 01, 2007 at 07:23 AM PDT #

Thanks John. I note your KSVM (Kitchen Sink VM) proposal on the hotspot-dev list, and agree... it's almost a necessity of the KSL idea as some language features will need VM support. Regarding unboxing, I think I see what you mean (not being one familar with all terms). Essentially, I take it that you're saying that:
Complex sum = Complex.add( complex1, complex2);
would pass complex1 and complex2 by value and return a complex by value... all on the stack. I can certainly see some perf benefits there.

Posted by Neale on August 01, 2007 at 08:43 PM PDT #

Hi John,

Just refactoring some code and realised the killer application for your changes.

If I'm able to have a function:

public int x, Object b doSomething(){
...
return 1, new Object();
}

as would be the case with your proposals, then it gets around one thing that I find a pain when trying to clean up poorly written code, which is the issue of there being more than one value modified a block of code where I'm attempting to extract a method.

While it's possibly not going to remain that way, having the language option does mean that it gives a refactoring stepping stone that can be automated and tested.

As it is... I'll get back to sorting out "someone else's code".

Cheers,

Neale

Posted by Neale on August 08, 2007 at 08:19 AM PDT #

This appears to have some overlap with proposals for BeanProperties. Perhaps there is a more general solution which can handle both cases.
As well as or instead of immutable objects I would suggest a copy-on-write objects.

Posted by Peter Lawrey on October 16, 2007 at 07:12 PM PDT #

Mmm. I can't believe how time has flown since last August.

Just picked up some code that encodes points on the earth's surface as a 3d vector.

When converting back to lat/lon, there are some trig functions that feature in both of those conversions, but I cannot write: (lat,lon) = point.getLatLon() when only able to 'catch one ball'.

My saviour here, if it were perf critical is to hope that both getLat() and getLon() would be inlined and the common trig functions would then get merged into a single call.

Roll on Tuples.. and the Java 8 language feature to allow:
public float,float getLatLon(){
...
return lat,lon;
}

Posted by Neale on June 20, 2008 at 09:04 PM PDT #

Hear is a library that adds tuple functionality:

http://code.google.com/p/google-collections/issues/detail?colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary&q=43&can=1&id=43

Your example would be:

public T2<Float, Float> getLatLon() {
...
return t(lat, lon);
}

There are advantages in the library approach over the language approach. In most language approaches you can't turn the tuple into a class at a latter date, with a library you can.

If Java inferred type from the RHS of a declaration then there would be little need for a language change since the syntax of the library version would become very short. It is already short due to the overloading of static method t in the library (see example above), but it could be even shorter.

Posted by Howard Lovatt on August 17, 2008 at 11:13 AM PDT #

Supporting values at the level of JVM is also important for implementing concept-oriented programming (COP) languages. In concept-oriented programming, concept is defined as a pair of one reference class and one object class, and then concepts are used instead of classes. A reference class is actually a value class while object class is a normal class. It is important that values and objects in COP are treated symmetrically as two sides of one element. Therefore, having no values in byte-code is a serious difficulty when implementing such a language. Here is my blog post about this issue: http://conceptoriented.org/blogs/cob/2009/05/12/values-vs-objects-in-cop

Posted by Alexandr Savinov on September 04, 2009 at 02:00 AM PDT #

all: Thank you for the comments and references.

I am working on an updated post on value types.

Posted by John Rose on November 10, 2009 at 08:02 AM PST #

Hi John, you mentioned an update on value types. Any news on that?

Posted by Ruediger Keller on April 29, 2010 at 01:17 AM PDT #

+1 on the post, also intested in an update. Found this post by means of reference on the Disruptor project - https://code.google.com/p/disruptor/

Posted by B. K. Oxley (binkley) on June 23, 2011 at 04:54 AM PDT #

Any chance we may see this in Java8?

Posted by guest on October 01, 2011 at 09:24 AM PDT #

Erasure is a bad idea, for example, I've tried to explain till I'm blue in the face, the problem of using Generic's in java.rmi.Remote interfaces, but all too often I get blank looks and arguments.

The problem is, code compiled separately, but linked dynamically at run time, although using a common interface is broken if that code has been subject to erasure, because it cannot be checked at runtime and has not been checked at compile time either.

Don't make the same mistake Sun did with Generics, increment the byte code version, developers have the option of compiling to an earlier version, it's no big deal. Even better fix Generics by eliminating erasure.

We're at an exciting time in Java's history, where we'll soon get to take advantage of immutability and functional programming idoms, lets not add complexity by using erasure.

In an ever more distributed world, we need to consider the distributed aspects of Java.

Posted by guest on June 12, 2013 at 05:21 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

John R. Rose

Java maven, HotSpot developer, Mac user, Scheme refugee.

Once Sun and present Oracle engineer.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today