fixnums in the VM

Or, the headless object rides again.

Introduction

Dynamic languages typically perform arithmetic using dynamically typed references to boxed numbers. Normally (at least in a JVM) a dynamically typed reference is a pointer to an object in the heap, whose header in turn contains the required dynamic type information. But language-specific implementations often use pseudo-pointers to represent a commonly used subset of numbers, with the actual bits of the pseudo-pointer carrying a value field “payload”. Also, because their relaxed typing encourages data structure reuse, dynamic languages typically feature a few small but ever-present types like Lisp’s cons cell which are overwhelmingly common in the heap. Again, language specific implementations have historically provided “headerless” representations for these, in which the fields are stored in the heap but are not accompanied by a header.

The JVM can support fixnums and other headerless objects for the sake of these languages, and even for Java. The idea is to make ordinary object pointers (sometimes called oops) coexist with more specialized formats for headerless objects, which we will call iops and xops. The techniques are mature and well-known, and the overheads (of extra tag checking before pointer usage) can be controlled by optimizations already used widely in JVMs. In particular, the Hotspot JVM could cleanly represent fixnums and other immediate data types by modifying extending its oop type.

Background

In Hotspot, a typical mature JVM, an object header is two words, a mark and a klass. (That ‘k ’is not a typo, but a dodge to avoid a reserved keyword in C++.) The mark is used for bookkeeping of operations like synchronization and identity hash code. The klass is a full machine pointer, to a metadata object which describes the object’s layout, methods, and other type information. A dynamic typing operation must first load the klass and then query its contents. (The actual query logic may require one more dependent load, from the klass object.) This object layout is flexible and general, but it requires two machine words of overhead in every object, plus a possible word or so of alignment overhead, plus (of course) the payload of the object, i.e., the actual fields containing application information.

In many language-specific systems, integers of 20-30 bits or less are compressed into pseudo-pointers. Lisp and Ruby call them fixnums, while Smalltalk uses the name SmallInteger. Some languages, like Emacs Lisp, simply call them “integer” and do not allow overflow. Most dynamic languages check for arithmetic overflow beyond the 20-30 bits, and either signal an error or provide a more or less seamless transition from fixnum to bignum, i.e., from a machine word subrange to a multiple precision array-based representation.

Integer overheads

In the case of a Java Integer on a 32-bit Hotspot JVM, the 32-bit payload (a Integer.value field) is accompanied by a 96 additional bits, a mark, a klass, and a word of alignment padding, for a total of 128 bits. Moreover, if there are (say) six references to this integer in the world (threads plus heap), those references also occupy 192 bits, for a total of 320 bits. On a 64-bit machine, everything is twice as big, at least at present: 256 bits in the object (which now includes 96 bits of padding), and 384 bits elsewhere. By contrast, six copies of an unboxed primitive integer occupy 192 bits. The point of introducing fixnums is that the overhead of dynamically typed integers can be much closer to that minimum of 192 bits than it is now. (The number six is chosen arbitrarily to model a common use case, a number that has been boxed and then copied a few times by procedure calling or other computation.) Here is a summary of the sizes:

Representation Heap bits Reference bits (N=6) Total bits Comparison
primitive int 0 32 192 1.00
fixnum, 32-bit JVM 0 192 192 1.00
object, 32-bit JVM 128 192 320 1.67
fixnum, 64-bit JVM 0 384 384 2.00
object, 64-bit JVM 256 384 640 3.33

When Java introduced autoboxing of primitive ints into Integer objects, application codes began to perform implicit creation of integer objects, and the allocation rate went up. Of course it depends on the application, but codes which mix generic collections (like List<>) with primitive values routinely box and unbox frequently. The implicit boxing is done by the factory method Integer.valueOf(int). This method has been carefully specified to allow memoization or uniquification techniques, so that more references can share fewer copies of the same value.

In the case of a widely shared heap object, such as Integer.valueOf(13), the heap overhead is negligible, but there is still an execution overhead from loads of klass, klass internals, and Integer.value fields, with associated costs of cache traffic and occupancy. With fixnums, dynamically typed arithmetic can proceed with no memory references at all.

Note that in a 32-bit system compromises must be made in representing the 32-bit value of an integer. Historically, fixnums represent integers in widths like 30, 28, or 24 bits, leaving a few additional bits available to distinguish the fixnum from ordinary object pointers and other kinds of pointer-like values. In a 32-bit JVM, some calls to Integer.valueOf will not be able to return a fixnum; they will have to product ordinary objects. But in practice, no applications use all 32 bits of every integer; most integers are small values like array indices, and can be compressed into less than 32 bits. This is true (to a lesser extent) of longs, floats, and doubles. Smaller values like characters and booleans can be stored in full precision.

On a 64-bit JVM, there is so much “slack” in the pointer layout that the full 32 bit range of values can be represented in fixnum format. Even larger values like longs and doubles can fit into the pointer layout, with a little compression. But regardless of pointer size, some Integer object will always have to be ordinary objects, such as in strange edge cases where reference equality is significant, or when (wrongheadedly) a code synchronizes on a Integer object.

This reasoning can be applied, and the fixnum optimization constructed, for any JVM type with the following properties:

  • The object contains a payload of a machine word or less.
  • The payload is immutable (a final and/or private field, unmodified outside the constructor).
  • If the payload is a machine word, it is usually compressible (e.g, via sign truncation).
  • The type is heavily used, enough to justify the complexity of a headerless representation.
  • The object is never used for synchronization, or can be created in a locked state.

The last requirement (no synchronization) is necessary, since synchronization is a kind of object state that can potentially live in an object’s header, regardless of the immutability of its payload. Since the type Object itself supports synchronization, it is hard to avoid in the JVM. However, there are two observations here which can let us make progress:

  1. Objects which the user merely shares (and does not allocate) can behave as if someone else had locked them and will never unlock them.
  2. Types not already in the Java language do not need to support synchronization.

The first case covers Integer.valueOf and the other factory methods used by autoboxing. The second case could cover closures, method handles, tuples, complex numbers, etc. It will probably be a good idea to standardize on a marker of some sort which disables the locking feature for a given class (and its subclasses). A marker interface Unsynchronized is probably best, since synchronization is, for better or worse, a public operation.

Let’s call any pseudo-pointer format which carries its payload directly an immediate object pseudo-pointer, or iop. Below we will discuss more specific implementation tactics for iops.

Headerless pairs, tuples

A similar analysis can be made for small objects, objects for which the extra space occupied by an ordinary header is a significant overhead. fFor example, in some Lisp applications, the majority of the heap is occupied by small two-element tuples called cons cells, which Lisp uses to build lists and trees. Removing the JVM header from a cons cell can double the effectiveness of memory—and, more importantly, of cache.

(A further optimization called cdr coding could double memory density again, almost, by eliding next pointers if they immutably point to a following cons cell created at the same time. But this would require interior pointers on the JVM, a topic which must wait for another day.)

Although a headerless pair would be referred to via a pointer, it would not qualify as an ordinary object pointer, but rather as a specialized headerless pointer. Let’s call this an extraordinary object pointer, or xop. As with an iop, an xop must somehow be tagged to distinguish it from an ordinary object pointer. (Following Smalltalk tradition, Hotspot calls the latter an oop.) Along with the tagging, some means of extracting a klass value is needed. The klass could be bit-encoded into the reference as with an iop, especially on 64-bit systems. Since an xop refers to memory, the klass could also be stored in memory in such a way (e.g., a card header) that it is shared by several nearby objects of the same type. Or it could just be stored in an abbreviated object header (one-word, markless).

An extraordinary object pointer could be useful for any JVM type with the following properties:

  • The object contains a payload of a few machine words.
  • The payload need not be directly synchronized. (E.g., it is immutable, or does not need multiword transactions, or is single-threaded, or is locked by an external mutex.)
  • The type is heavily used, enough to justify the complexity of a headerless representation.
  • The object is never used for synchronization, or can be created in a locked state.

Types which might possibly benefit from this treatment include any immutable tuple type, method references or closures, Lisp cons cells, multipart numeric types like complex, or strings.

(Pronunciation Warning: In homage to Dr. Seuss, I intend to pronounce these neologisms monosyllabically as Yopps and Zopps.)

Dirty details

Pseudo-pointer format

In JVMs where this makes sense, such as Hotspot, one can divide the machine word of an oop into an implementation-defined set of three fields, tag, klass, and value. The sizes of these are tunable, but on 32-bit systems the tag is generally 2-3 bits, the klass 0-5 bits, and the value 24-31 bits. On 64-bit machines, the klass may in fact be a restricted pointer value (31 bits or so), with klass and tag sharing one 32-bit subword and the value in the other. For jumbo payloads, the value could be expanded to almost 64 bits, at the expense of klass and tag.

Tag: On most machines, the tag is in the high bits. On machines that require address alignment, the tag could be the low bits, some of which are non-zero (the traditional “low-tag” representation), though this could impose an extra check on unaligned operations. In all cases, tag bits are a fixed combination that is distinct from all valid heap pointer values and in addition should (if at all possible) cause a trap of some sort if it is used. This will enable certain favorable compilation tricks[1], exactly like the ones that JVMs use today for null pointer decoding.

Klass: On 32-bit systems, the klass must be an index into a table of well-known classes. On 64-bit systems it can be a more general pointer. But the generality would still limited to classes with single immutable value fields, and factory methods like Integer.valueOf! So the generality of a full klass pointer is only useful if it helps with performance.

The tag and klass fields can often be unified. For example, if the three low bits are used for an alignment-based tag, then a unified field (say, the low eight bits) can be used, where any klass value can be used as long as at least one of its three low bits is non-zero. Or, the tag field can be used to control the width of the klass field, allowing a tunable range of value widths.

Value: For integers and other values of 32 bits or less, the value field is obviously a clipped copy of what would ordinarily be stored in the value field of the class, if the reference were a regular object. With the low-tag representation, the value can be extracted in a single instruction (signed right shift). For floats, the value can be zero-filled to the right.

A straw-man design

Though I do not have time for it at present, if I were to try out this design space today in Hotspot, I might start hacking with the following specific parameters:

  • 32-bit JVM (more challenging than 64-bit, but a more immediately usable result)
  • 28-bit long immediate or 20-bit short immediate value field, left justified
  • 2-bit right-justified tag (supported directly by SPARC memory bus, requires explicit testing on Intel)
  • 2-bit tag values of 1 or 3 are reserved for iops
  • only iops have klass fields
  • if 2-bit tag is 1, the klass field is 2 bits and value is 28 bits
  • 28-bit long immediate value represents an int (later, perhaps long, float, and/or double)
  • if 2-bit tag is 3, the klass field is 10 bits and value is 20 bits
  • initial short immediate classes are void, byte, boolean, short, and char.
  • xops to be attempted only after iops are working
  • 2-bit tag of 2 reserved for xops
  • a klass pointer at the front of each xops, at value+4, and the first field at value+8
  • in a xop, the word at value+0 is part of the previous object; it overlays the mark field of any oop

Semantic features of fixnums and other iops

  • The call java.lang.Integer.valueOf((int)x) returns a fixnum if x is in the supported subrange.
  • All Integer methods just work as normal on this alternate representation. Internally, the JVM codes them as a tiny if/then/else, controlled by a tag check.
  • Ditto for all Object and Number methods (toString, etc.).
  • Regarding monitorenter, wait, etc.: All iops must behave (w.r.t. subsequent synchronization attempts) as if they were locked immediately upon creation, and are never unlocked.
  • Explicit construction expressions new Integer(x)) must continue to create ordinary objects.
  • Compiler escape analysis may sometimes be able to weaken explicitly allocated integers to iops.
  • The identity hash code of any iop is obtained by extracting as many of its value and klass bits as will fit in the 31-bit result.

A number of other classes might merit the iop treatment. Here is a list ordered by roughly decreasing plausbility. Note that there are two alternate representations for strings; they could co-exist with some care.

  • Byte, Short, Character, Boolean, Void
  • most enum instances
  • Long, Float, Double
  • very short strings (value packs 0-3 7-bit bytes or one 20-bit extended char)
  • well-known interned strings (value is index into global table of packed UTF8 sequences)
  • some zero-length arrays or strings (note: this should combine the length check with a tag check)

Semantic features of cons cells, tuples, and other xops

Because of the interactions with object identity, extraordinary object pointers fall into two categories: Immutable and mutable.

In the immutable case, user-level semantics are simple, while the JVM maintains the fiction that object identity and structure equality are the same. We will use the complex number (a 2-tuple of doubles) as an example:

  • The complex type is defined from the start as unsynchronized, so that the JVM is free to omit headers uniformly.
  • The expression new Complex(x,y) always produces a headerless object, as does the underlying newinstance instruction.
  • All Object methods just work as normal on this headerless representation. Internally, the JVM codes them as a tiny if/then/else, controlled by a tag check.
  • Re. monitorenter, etc.: Complex numbers, like all other xops and iops, must behave as if they are “locked at birth”.
  • The acmp instructions (Java’s reference equality comparisons) are illegal when applied to two complex numbers (or any two xops). Enforcing this requires extra bit testing, mitigated by some type analysis in the compiler. As an exception, the code of the complex class can use reference equality comparison without restriction on the self type. (That is, the operator acts like a protected method.)
  • The identity hash code of complex number is computed by calling its hashCode method.
  • The equals and hashCode methods from Object must be overridden by Complex.
  • The JVM is free to clone and unclone equivalent complex number instances, in the GC and/or on the fly, without notifying the application.

Note that, under these rules and those of Java, objects produced by Integer.valueOf can be represented as immutable xops when they do not fit in iops.

The semantics of mutable xops are somewhat less regular, and so the concept may not be as valuable. We will use the (mutable) cons cell as an example. Here are the points where they differ from their immutable cousins:

  • The identity hash code of a mutable cons cell is -1. This lets identity-sensitive hash tables detect unsuitable keys quickly. (Or should System.identityHashCode throw an error?)
  • As with regular oops, the system is not free to clone or unclone mutable cons cells.
  • Reference equality works the same as with oops.

Implementation consequences

  • Some types implemented with non-oops may be invisibly split by the JVM into ordinary and immediate or extraordinary formats. Additional type inference and tag testing is required wherever such splits can occur, or where super types like Object may disguise the non-oops.
  • A getfield instruction for Integer.value (etc.) must first test the tag, and then issue either a load or a shift to obtain the value field.
  • Bytecodes which access the object header must also test the tag. (These include monitorenter, checkcast, instanceof instructions.)
  • Invocations of Object (and Number) methods must also test the tag. Access to unrelated classes (e.g., String) will be unaffected.
  • Compilers can use the usual combination of static analysis and profile-driven optimism to make the overheads go away.
  • If bus alignment is being used to check low-tags, and a super class has a field of size N (e.g., a Object.klass header field of size 4), be sure that any possible non-oops have low bits that are misaligned modulo N, or else check the bits explicitly. This could creep up, say with an attempted memory access to the Byte.value field via a iop that happened to look like an address.

Conclusion

We have examined low-level techniques for headerless objects, and investigated how to integrate them into the JVM and even (in some cases) the Java language. The techniques are on a sliding scale from the straightforward (low tag bit for immediate Integer only) to the subtle (compile-time and profile-driven type estimation). Implementing even some of this in an existing JVM is likely to be an adventure on the order of, say, compressing oops. A language implementor might prefer, at first, to try these techniques in the closed environment of a from-scratch, specialized VM. But it is probably more profitable in the end to fit this these techniques into an optimized, mature JIT and GC, such as Hotspot... Especially now that Hotspot is part of the Open JDK, and even has a Multi-Language VM subproject specifically for these explorations!


Footnote [1] ...certain favorable compilation tricks... What compilation tricks are those, you ask? Well, most pointer indirection instructions in a JVM will never provoke null exceptions. Yet the JVM is obligated to always defend against possible nulls. Some instructions can statically be proven never to see null, while many others cannot be so proven. The latter are compiled optimistically, so that if a null ever does come through, the instruction will cause some sort of hardware-level trap. In that case the null check comes for free with the CPU address decode logic. The recovery from a failure of such a check is very complex and painful, but complete; the result is that the newly failing instruction is emulated somehow in the interpreter or by other cleanup code. It is OK for that recovery to be very expensive, especially if the JVM recompiles the offending instruction to be guarded by an explicit null check. The JVM keeps a historical profile about exceptional events, which the compiler consults; it optimistically uses an implicit null check when the profile looks clear, and the system backs off to a slightly more expensive explicit check when that fails. The same thing can be done with tag bits, if they look ugly enough to make the CPU throw a trap.

Bug Reference: 6674617, dynamic languages need more efficient autoboxing

Comments:

Regarding the problem with synchronization on arbitrary Objects:

Shouldn't it be possible for the JVM to decide statically, which classes are ever used to synchronize on?

I'd guess that practically no-one actually synchronizes on plain Objects (or if they do, then only on one certain instance they use as a monitor), Integers, Strings, ...

It might be a worthwhile optimization for the VM to detect which classes actually need synchronization and omit the bookkeeping information needed for that from most objects.

Of course de-optimizing this kind of thing in the face of code loading might be uh, tricky :-)

Posted by Martin Probst on March 12, 2008 at 07:48 PM PDT #

Another excellent article, and on a topic I really wanted to hear you talk about. (Asking about this on your google group is\^Wwas on my to-do list.)

Any chance you can get that bug's title changed? As you mention in your article, it's not just dynamic languages that would benefit here: static languages with autoboxing and/or a Smalltalk-like blurring of the distinction between machine integers and greater/arbitrary precision integers benefit too. Most importantly (as you say), Java belongs to the first class.

Sadly, I can't see the linked-to bug yet, or I'd vote for it.

If you'd like to write the other John Rose article I'd love to read (other than the one that says "I've just committed a 'fixnum' implementation to Hotspot", that is), how about:

You know what I'd love? Someone like Sun's John Rose to go over the JVMS "Compiling for the Java Virtual Machine" chapter and tell us what kind of code we should be generating for the modern HotSpot JVM. Which idioms and bytecode-level optimizations are good, and which idioms we should avoid. Especially balanced against issues like code size, verifiability, and future-proofing.
(from http://elliotth.blogspot.com/2008/03/generating-jvm-bytecode.html )

Posted by Elliott Hughes on March 13, 2008 at 04:08 AM PDT #

Martin: In a strict sense, JVMs can do little statically about Object methods, since any reference can lose its static typing by casting to Object. (I count identityHashCode and monitorenter as Object methods.) But you mean an optimistic (quasi-static) optimization in which some classes assume (pending further evidence) that they will not be treated as full objects. The key question is always the deoptimization cost. In this case, it looks like a full GC could be required to reformat the affected object(s). It's the Smalltalk "become" primitive, the nasty version which can change an object's size and layout. I don't think I'm brave enough to make these transformations automagically. Maybe they would be justified in the presence of assurances from the user.

Elliott: I have some good news for you about code quality investigations. I have a change out for review[1] on OpenJDK which lets you disassemble the JIT's output. You probably already know about the LogCompilation option[2], which helps you read the JIT's mind. As far as listing the relevant optimizations, the problem is that the people who are up to speed on the JIT are at this point too few and harried to sit down and write up everything they know. (I'm trying to do my part.[3] I find the future easier to write about than the past.) I expect that in a couple of years a lot of people outside of Sun will be working on the Hotspot JIT, and I hope they will be blogging their learning curve as they go.

[1] http://webrev.invokedynamic.info/jrose/6667042
[2] http://groups.google.com/group/jvm-languages/browse_thread/thread/2b92e7d42f7a2f63
[3] http://wikis.sun.com/display/HotSpotInternals/PerformanceTechniques

Posted by John Rose on March 13, 2008 at 08:21 AM PDT #

Very interesting article. An alternative is to mark final, immutable, and non-nullable classes that could therefore be stack allocated with an annotation or interface, e.g.:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4617197

Then the JVM can decide to stack allocate these objects.

Posted by Howard Lovatt on March 15, 2008 at 09:29 PM PDT #

Thanks, that's interesting. (It would be awesome if we could get an up-to-date JVM spec on the web, though, instead of the current 2e + random uncollected corrections and amendments.)

One specific note about the wiki PerformanceTechniques page: I'm curious about the claim that static final fields are fast. That's not my experience at all. I have a random benchmark that's mostly double arithmetic. Take the baseline performance as using Ljava.lang.Double; (actually a different boxing class implementing a "numeric" interface, but similar) and no optimizations. I can save 20% by using invokestatic instead of invokeinterface. I can save 80% by using D instead of Ljava.lang.Double;, but that's still 5x slower than the equivalent program in Java compiled by javac. Changing two constants used in the innermost loop from "static" to "static final" fields makes no obvious difference. Changing them from fields to locals gives me Java-equivalent performance. or, if I keep the boxing, just this tiny change knocks 80% off the baseline; it's actually cheaper to box these would-be constants each time round my innermost loop than to getstatic them.

This seems to be true of the client and server compilers alike.

Am I missing something? Doing something wrong? Or is there some subtlety to the statement that I missed? I've looked at diffs between the two generated classes, and other than aload instead of getstatic, the only differences are that the fast version uses more local slots, some of the locals have higher indexes (obviously), and the code generated for the constants is now in the innermost loop. All those seem like they'd be bad, if anything. And yet the aload version takes 20% as long as the getstatic version.

(Full details/source/generated code available on request, but I'm assuming you'll be able to answer this in one sentence anyway, with a "typo" or "you're forgetting $whatever".)

Posted by Elliott Hughes on March 23, 2008 at 01:06 PM PDT #

To answer my own question:

http://elliotth.blogspot.com/2008/03/generating-jvm-bytecode-2.html

I don't know \*why\* the messy stack had this effect, but I do know that was the problem.

Posted by Elliott Hughes on April 05, 2008 at 03:44 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

John R. Rose

Java maven, HotSpot developer, Mac user, Scheme refugee.

Once Sun and present Oracle engineer.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today