symbolic freedom in the VM

Or, watching your language with dangerous characters.

Introduction

The JVM uses symbolic names to link together the many hundreds of classes that make up an application.  Like source code in the Java programming language, these symbols fall into a small number of categories:  Class, package, method, field, type signature.  Unlike source code, symbols in the JVM are represented in a uniform syntax, a counted sequence of Unicode characters.  Since this same format is used in class files to represent String literals, and since Strings are arbitrary character sequences, perhaps the JVM can readily accept class, package, method, field, and type names which could be any string, not just the strings accepted by the Java compiler.  Let's call such names “exotic names”.

The JVM originally inherited symbol spelling restrictions from the Java language, but in recent years it has removed most restrictions. This note describes how to remove the remaining restrictions, by presenting a universal mangling convention to encode arbitrary spelling strings into a form which is permitted by the JVM. This mangling is easy for humans to read and for machines to decode.

The motivation for this is from non-Java languages, which have their own rules for composing symbols of types, variables, and so on.  Some languages, like Common Lisp, allow any string whatever (even the empty string!) to spell a symbol name.  Languages with operator redefinition absolutely require a way to process symbols like “+” (a single plus sign).  When such a language meets the JVM, language-specific names must either be kept completely separate from JVM names, or the language’s bytecode compiler must use some sort of name mangling to keep the JVM from panicking.

Quick Start

For those who prefer to guess rationale from bald facts, here are the encoding rules in tabular form:

Dangerous Character Why Dangerous Where Illegal Escape Sequence
/ 002F delimits a package prefix in a class name any name \| 005C 007C
. 002E looks like a package prefix any name \, 005C 002C
; 003B delimits a type within a field or method signature
any name \? 005C 003F
$ 0024 looks like a nested class name or synthetic member nowhere \% 005C 0025
< 003C looks like <init>, delimiter in generic type signature method name \^ 005C 005E
> 003E looks like <init> method name \_ 005C 005F
[ 005B begins the name of an array class class name \{ 005C 007B
] 005D not dangerous, but goes with ]; reserved nowhere \} 005C 007D
: 003A not dangerous, but reserved for language use nowhere \! 005C 0021
\ 005C not dangerous, except when forming an accidental escape nowhere \- 005C 002D
(null string) bytecode names must be non-empty any name \= 005C 003D

Avoiding Dangerous Characters

The JVM defines a very small set of characters which are illegal in name spellings. We will slightly extend and regularize this set into a group of dangerous characters. These characters will then be replaced, in mangled names, by escape sequences. In addition, accidental escape sequences must be further escaped. Finally, a special prefix will be applied if and only if the mangling would otherwise fail to begin with the escape character. This happens to cover the corner case of the null string, and also clearly marks symbols which need demangling.

Dangerous characters are the union of all characters forbidden or otherwise restricted by the JVM specification, plus their mates, if they are brackets ([ and ], < and >), plus, arbitrarily, the colon character :. There is no distinction between type, method, and field names. This makes it easier to convert between mangled names of different types, since they do not need to be decoded (demangled).

The escape character is backslash \ (also known as reverse solidus). This character is, until now, unheard of in bytecode names, but traditional in the proposed role.

Replacement Characters

Every escape sequence is two characters (in fact, two UTF8 bytes) beginning with the escape character and followed by a replacement character. (Since the replacement character is never a backslash, iterated manglings do not double in size.)

Each dangerous character has some rough visual similarity to its corresponding replacement character. This makes mangled symbols easier to recognize by sight.

The dangerous characters are / (forward slash, used to delimit package components), . (dot, also a package delimiter), ; (semicolon, used in signatures), $ (dollar, used in inner classes and synthetic members), < (left angle), > (right angle), [ (left square bracket, used in array types), ] (right square bracket, reserved in this scheme for language use), and : (colon, reserved in this scheme for language use). Their replacements are, respectively, | (vertical bar), , (comma), ? (question mark), % (percent), ^ (caret), _ (underscore), and { (left curly bracket), } (right curly bracket), ! (exclamation mark). In addition, the replacement character for the escape character itself is - (hyphen), and the replacement character for the null prefix is = (equal sign).

An escape character \ followed by any of these replacement characters is an escape sequence, and there are no other escape sequences. An equal sign is only part of an escape sequence if it is the second character in the whole string, following a backslash. Two consecutive backslashes do not form an escape sequence.

Each escape sequence replaces a so-called original character which is either one of the dangerous characters or the escape character. A null prefix replaces an initial null string, not a character.

All this implies that escape sequences cannot overlap and may be determined all at once for a whole string. Note that a spelling string can contain accidental escapes, apparent escape sequences which must not be interpreted as manglings. These are disabled by replacing their leading backslash with an escape sequence (\-). To mangle a non-empty string, three logical steps are required, though they may be carried out in one pass:

  1. In each accidental escape, replace the backslash with an escape sequence (\-).
  2. Replace each dangerous character with an escape sequence (\| for /, etc.).
  3. If the first two steps introduced any change, and if the string does not already begin with a backslash, prepend a null prefix (\=).
To mangle the empty string, prepend a null prefix.

To demangle a mangled string that begins with an escape, remove any null prefix, and then replace (in parallel) each escape sequence by its original character.

Spelling strings which contain accidental escapes must have them replaced, even if those strings do not contain dangerous characters. This restriction means that mangling a string always requires a scan of the string for escapes. But then, a scan would be required anyway, to check for dangerous characters.

Nice Properties

If a bytecode name does not contain any escape sequence, demangling is a no-op: The string demangles to itself. Such a string is called self-mangling. Almost all strings are self-mangling. In practice, to demangle almost any name “found in nature”, simply verify that it does not begin with a backslash.

Mangling is an invertible function, while demangling is not. A mangled string is defined as validly mangled if it is in fact the unique mangling of its spelling string. Three examples of invalidly mangled strings are \=foo, \-bar, and baz\!, which demangle to foo, \bar, and baz:, but then remangle to foo, \bar, and \=baz\-!. If a language back-end or runtime is using mangled names, it should never present an invalidly mangled bytecode name to the JVM. If the runtime encounters one, it should also report an error, since such an occurrence probably indicates a bug in name encoding which will lead to errors in linkage. However, this note does not propose that the JVM verifier detect invalidly mangled names.

As a result of these rules, it is a simple matter to compute validly mangled substrings and concatenations of validly mangled strings, and (with a little care) these correspond to corresponding operations on their spelling strings.

  • Any prefix of a validly mangled string is also validly mangled, although a null prefix may need to be removed.
  • Any suffix of a validly mangled string is also validly mangled, although a null prefix may need to be added.
  • Two validly mangled strings, when concatenated, are also validly mangled, although any null prefix must be removed from the second string, and a trailing backslash on the first string may need escaping, if it would participate in an accidental escape when followed by the first character of the second string. A null prefix may need to be added to the result.

If languages that include non-Java symbol spellings use this mangling convention, they will enjoy the following advantages:

  • They can interoperate via symbols they share in common.
  • Low-level tools, such as backtrace printers, will have readable displays.
  • Future JVM and language extensions can safely use the dangerous characters for structuring symbols, but will never interfere with valid spellings.
  • Runtimes and compilers can use standard libraries for mangling and demangling.
  • Occasional transliterations and name composition will be simple and regular, for classes, methods, and fields.
  • Bytecode names will continue to be compact. When mangled, spellings will at most double in length, either in UTF8 or UTF16 format, and most will not change at all.

Suggestions for Human Readable Presentations

For human readable displays of symbols, it will be better to present a string-like quoted representation of the spelling, because JVM users are generally familiar with such tokens. We suggest using single or double quotes before and after symbols which are not valid Java identifiers, with quotes, backslashes, and non-printing characters escaped as if for literals in the Java language.

For example, an HTML-like spelling <pre> mangles to \^pre\_ and could display more cleanly as '<pre>', with the quotes included. Such string-like conventions are not suitable for mangled bytecode names, in part because dangerous characters must be eliminated, rather than just quoted. Otherwise internally structured strings like package prefixes and method signatures could not be reliably parsed.

In such human-readable displays, invalidly mangled names should not be demangled and quoted, for this would be misleading. Likewise, JVM symbols which contain dangerous characters (like dots in field names or brackets in method names) should not be simply quoted. The bytecode names \=phase\,1 and phase.1 are distinct, and in demangled displays they should be presented as 'phase.1' and something like 'phase'.1, respectively.

Fine Print

These rules build upon the JVM specification, as modified by JSR 202.   The relevant language goes something like this:

4.3.2   Unqualified Names
Names of methods, fields and local variables are stored as unqualified names.  Unqualified names must not contain the characters '.', ';', '[' or '/'. Method names are further constrained so that, with the exception of the special method names <init> and <clinit> (§3.9), they must not contain the characters '<' or '>'.

JVMs use these new relaxed identifier rules as of Java 5 and later.

The JVM requires that the bytecode name of a method not contain angle brackets, but it allows them in fields and type names. Actually, there is a problem with putting left angle brackets in type names, since left angle bracket is also a delimiter character in generic signature encodings, such as LFoo<T>;. (Does the preceding mean an unparameterized type spelled Foo<T> or is it an instance of the generic spelled Foo?) Thus, left angle bracket is a character that the JVM specification does not realize is dangerous! A future version of the JVM spec. could amend this, and simplify the rules, by declaring angle brackets illegal everywhere (except for certain method names). It should also make left square bracket legal in type names, except in the first character position of a qualified type name (where it may denote an array type).

There are plenty of other characters which look dangerous, but are innocuous to the JVM. For example, in settings where method names are concatenated with method signatures, it might seem that parentheses pose a danger to correctly parsing them apart again. (Try this concatenation: (I)(D)(L)(J)(;)V. Can you tell which is method name and which is signature?) Such concerns about parsing apply (in various settings) to spaces, newlines, the null character, etc. For this specification, we do not propose to predict all such parsing risks, and instead focus on exactly those characters which the JVM itself disallows. The concept of display names introduced above can help with applications which must produce parseable text that include mangled names. (The example above could be displayed in a more parser-friendly form as '(I)(D)'(L')(J)(';)V.) In the case of method signatures, note that method (and field) names and signatures are always presented in the class file as separate CONSTANT_Utf8 references, so there is no need (inherent in the JVM) to concatenate or parse them.

Bytecode compilers are encouraged to use this mangling whenever they represent a name directly to the JVM (as a so-called “bytecode name”).  In the very few places (if any) where the JVM or reflective APIs undertake to transform names from the bytecode level to language-level strings, they should remove the mangling also.

The escape character is not a universal “superquote”. There are only a few escape sequences, and any other occurrences of the backslash (reverse solidus) character are not treated specially. (They do not need further quoting to avoid treating them as accidental escapes.) In this, the escape character works like it does within Unix shell quoted strings, where (for example) "\x" is a string with two characters. 

It is actually useful define a few extra dangerous characters. In many languages, symbols have attributes beyond their basic spelling. (For example, Fortress symbols have a font attribute, and Common Lisp symbols have a package prefix.) Characters like square brackets and backquote which are allowed by the JVM but dangerous in the present sense are opportunities for adding more structure to the bytecode names of symbols. Languages are therefore free to use the colon character, and (in non-class names) square brackets and colon to add structure to their symbols. It may be best if the basic spelling of the symbol comes at a fixed position, so that low-level tools have a reasonable chance of demangling at least that part of the symbol. However, this is a matter for another layer of software to decide, specifically a metaobject protocol which is concerned with uniform naming and access of program elements. A name which consists of a sequence of colons and mangled names may be called a compound name. It is natural to use compound names with the invokedyanamic instruction, to convey necessary information about a call site to the metaobject protocol. Names like x:get and x:set are appealing for building clean property APIs.

The slash, dot, and semicolon characters have a more central role in the syntax of bytecode names and signatures, so languages should not use them. Future extensions of the JVM could appropriate these characters if the JVM itself needed to add structure to bytecode names. For example, the JVM specification allows dot . inside field names, but since manglings avoid that dangerous character in field names, then dot can be used, without conflict, as a delimiter for encoding tuple element references.

Remaining Issues

In order to test these ideas, I have coded them in Java and written a small test harness called StringNames.java. You are welcome to try it out. There is also mangling code in the OpenJDK, as part of JSR 292. In the Da Vinci Machine patch repository there are unit tests.

One bit of work has not been attempted here. We do not consider manglings over the much more restricted set of characters allowed by earlier versions of the JVM. Because these allowed only valid Java identifiers, some sort of much more complex and disruptive scheme would be required to encode free spellings as Java identifiers. It would probably use the dollar sign $ (dollar) as an escape character, and encode non-alphabetics as sequences of alphabetics in a high-base numbering system. Collisions with pre-existing uses of the escape character would complicate matters.

Another bit of work saved for later is removal of length restrictions. JVM bytecode names (and signatures that contain them) must be representable in less than 65536 bytes of (modified null-free) UTF8. I guess freedom is more of a journey than a destination...

Change Log and Acknowledgements

March 2008: Per Bothner pointed out that the dollar sign is in fact dangerous, since various bits of code (including Class.java) look for it as a special delimiter in bytecode names. So now we replace it by an escape sequence with a percent sign. (I guess I was too close to that problem to see it!)

Note that all this stuff works in Java 5 and later.

September 2008: The Da Vinci Machine project has a patch to javac which lets you pass exotic names through the javac frontend. It does not mandate this or any other mangling scheme.

February 2009: Improved the language a little, and put in cross-references to invokedynamic and compound names.

August 2009: Tweaked the mangling and concatenation rules. (Hat tip to David Chase.)

September 2012: Fixed web page damage around backslashes, updated a couple pathnames. (Hat tip to John Cowan.)

Comments:

[Trackback] Bookmarked your post over at Blog Bookmarker.com!

Posted by symbolic on January 23, 2008 at 05:20 PM PST #

During the conversion process, all the backslashes got doubled, thus unintentionally exhibiting an example of what's wrong with C-style backslash escapes. Please fix, as it's very confusing!

Posted by guest on August 17, 2012 at 08:56 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

John R. Rose

Java maven, HotSpot developer, Mac user, Scheme refugee.

Once Sun and present Oracle engineer.

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today