X
  • JVM |
    January 23, 2008

symbolic freedom in the VM

John Rose
Architect
Or, watching your language with dangerous characters.

Introduction

The JVM uses symbolic names to link together the many hundreds of
classes that make up an application.  Like source code in the Java
programming language, these symbols fall into a small number of
categories:  Class, package, method, field, type signature.
 Unlike source code, symbols in the JVM are represented in a
uniform syntax, a counted sequence of Unicode characters.  Since
this same format is used in class files to represent String literals,
and since Strings are arbitrary character sequences, perhaps the JVM
can readily accept class, package, method, field, and type names which
could be any string, not just the strings accepted by the Java
compiler.  Let's call such names “exotic names”.

The JVM originally inherited symbol spelling restrictions from the Java
language, but in recent years it has removed most restrictions.
This note describes how to remove the remaining restrictions,
by presenting a universal mangling convention to encode
arbitrary spelling strings into a form which is permitted by the JVM.
This mangling is easy for humans to read and for machines to decode.

The motivation for this is from non-Java languages, which have their own rules for
composing symbols of types, variables, and so on.  Some languages,
like Common Lisp, allow any string whatever (even the empty string!) to
spell a symbol name.  Languages with operator redefinition
absolutely require a way to process symbols like “+” (a single plus
sign).  When such a language meets the JVM, language-specific
names must either be kept completely separate from JVM names, or the
language’s bytecode compiler must use some sort of name mangling to
keep the JVM from panicking.

Quick Start

For those who prefer to guess rationale from bald facts, here
are the encoding rules in tabular form:







Dangerous CharacterWhy DangerousWhere IllegalEscape Sequence
/ 002Fdelimits a package prefix in a class nameany name\| 005C 007C
. 002Elooks like a package prefixany name\, 005C 002C
; 003Bdelimits a type within a field or method signature
any name\? 005C 003F
$ 0024looks like a nested class name or synthetic membernowhere\% 005C 0025
< 003Clooks like <init>, delimiter in generic type signaturemethod name\^ 005C 005E
> 003Elooks like <init> method name\_ 005C 005F
[ 005Bbegins the name of an array classclass name\{ 005C 007B
] 005Dnot dangerous, but goes with ]; reservednowhere\} 005C 007D
: 003Anot dangerous, but reserved for language usenowhere\! 005C 0021
\ 005Cnot dangerous, except when forming an accidental escapenowhere\- 005C 002D
(null string)bytecode names must be non-emptyany name\= 005C 003D


Avoiding Dangerous Characters

The JVM defines a very small set of characters which are illegal
in name spellings. We will slightly extend and regularize this set
into a group of dangerous characters.
These characters will then be replaced, in mangled names, by escape sequences.
In addition, accidental escape sequences must be further escaped.
Finally, a special prefix will be applied if and only if
the mangling would otherwise fail to begin with the escape character.
This happens to cover the corner case of the null string,
and also clearly marks symbols which need demangling.

Dangerous characters are the union of all characters forbidden
or otherwise restricted by the JVM specification,
plus their mates, if they are brackets
([ and ],
< and >),
plus, arbitrarily, the colon character :.
There is no distinction between type, method, and field names.
This makes it easier to convert between mangled names of different
types, since they do not need to be decoded (demangled).

The escape character is backslash \
(also known as reverse solidus).
This character is, until now, unheard of in bytecode names,
but traditional in the proposed role.

Replacement Characters

Every escape sequence is two characters
(in fact, two UTF8 bytes) beginning with
the escape character and followed by a
replacement character.
(Since the replacement character is never a backslash,
iterated manglings do not double in size.)

Each dangerous character has some rough visual similarity
to its corresponding replacement character.
This makes mangled symbols easier to recognize by sight.

The dangerous characters are
/ (forward slash, used to delimit package components),
. (dot, also a package delimiter),
; (semicolon, used in signatures),
$ (dollar, used in inner classes and synthetic members),
< (left angle),
> (right angle),
[ (left square bracket, used in array types),
] (right square bracket, reserved in this scheme for language use),
and : (colon, reserved in this scheme for language use).
Their replacements are, respectively,
| (vertical bar),
, (comma),
? (question mark),
% (percent),
^ (caret),
_ (underscore), and
{ (left curly bracket),
} (right curly bracket),
! (exclamation mark).
In addition, the replacement character for the escape character itself is
- (hyphen),
and the replacement character for the null prefix is
= (equal sign).

An escape character \
followed by any of these replacement characters
is an escape sequence, and there are no other escape sequences.
An equal sign is only part of an escape sequence
if it is the second character in the whole string, following a backslash.
Two consecutive backslashes do not form an escape sequence.

Each escape sequence replaces a so-called original character
which is either one of the dangerous characters or the escape character.
A null prefix replaces an initial null string, not a character.

All this implies that escape sequences cannot overlap and may be
determined all at once for a whole string. Note that a spelling
string can contain accidental escapes, apparent escape
sequences which must not be interpreted as manglings.
These are disabled by replacing their leading backslash with an
escape sequence (\-). To mangle a non-empty string, three logical steps
are required, though they may be carried out in one pass:


  1. In each accidental escape, replace the backslash with an escape sequence
    (\-).
  2. Replace each dangerous character with an escape sequence
    (\| for /, etc.).
  3. If the first two steps introduced any change, and
    if the string does not already begin with a backslash, prepend a null prefix (\=).


To mangle the empty string, prepend a null prefix.

To demangle a mangled string that begins with an escape,
remove any null prefix, and then replace (in parallel)
each escape sequence by its original character.

Spelling strings which contain accidental
escapes must have them replaced, even if those
strings do not contain dangerous characters.
This restriction means that mangling a string always
requires a scan of the string for escapes.
But then, a scan would be required anyway,
to check for dangerous characters.

Nice Properties

If a bytecode name does not contain any escape sequence,
demangling is a no-op: The string demangles to itself.
Such a string is called self-mangling.
Almost all strings are self-mangling.
In practice, to demangle almost any name “found in nature”,
simply verify that it does not begin with a backslash.

Mangling is an invertible function, while demangling
is not.
A mangled string is defined as validly mangled if
it is in fact the unique mangling of its spelling string.
Three examples of invalidly mangled strings are \=foo,
\-bar, and baz\!, which demangle to foo, \bar, and
baz:, but then remangle to foo, \bar, and \=baz\-!.
If a language back-end or runtime is using mangled names,
it should never present an invalidly mangled bytecode
name to the JVM. If the runtime encounters one,
it should also report an error, since such an occurrence
probably indicates a bug in name encoding which
will lead to errors in linkage.
However, this note does not propose that the JVM verifier
detect invalidly mangled names.

As a result of these rules, it is a simple matter to
compute validly mangled substrings and concatenations
of validly mangled strings, and (with a little care)
these correspond to corresponding operations on their
spelling strings.

  • Any prefix of a validly mangled string is also validly mangled,
    although a null prefix may need to be removed.
  • Any suffix of a validly mangled string is also validly mangled,
    although a null prefix may need to be added.
  • Two validly mangled strings, when concatenated,
    are also validly mangled, although any null prefix
    must be removed from the second string,
    and a trailing backslash on the first string may need escaping,
    if it would participate in an accidental escape when followed
    by the first character of the second string.
    A null prefix may need to be added to the result.

If languages that include non-Java symbol spellings use this
mangling convention, they will enjoy the following advantages:

  • They can interoperate via symbols they share in common.
  • Low-level tools, such as backtrace printers, will have readable displays.
  • Future JVM and language extensions can safely use the dangerous characters
    for structuring symbols, but will never interfere with valid spellings.
  • Runtimes and compilers can use standard libraries for mangling and demangling.
  • Occasional transliterations and name composition will be simple and regular,
    for classes, methods, and fields.
  • Bytecode names will continue to be compact.
    When mangled, spellings will at most double in length, either in
    UTF8 or UTF16 format, and most will not change at all.

Suggestions for Human Readable Presentations

For human readable displays of symbols,
it will be better to present a string-like quoted
representation of the spelling, because JVM users
are generally familiar with such tokens.
We suggest using single or double quotes before and after
symbols which are not valid Java identifiers,
with quotes, backslashes, and non-printing characters
escaped as if for literals in the Java language.

For example, an HTML-like spelling
<pre> mangles to
\^pre\_ and could
display more cleanly as
'<pre>',
with the quotes included.
Such string-like conventions are not suitable
for mangled bytecode names, in part because
dangerous characters must be eliminated, rather
than just quoted. Otherwise internally structured
strings like package prefixes and method signatures
could not be reliably parsed.

In such human-readable displays, invalidly mangled
names should not be demangled and quoted,
for this would be misleading. Likewise, JVM symbols
which contain dangerous characters (like dots in field
names or brackets in method names) should not be
simply quoted. The bytecode names
\=phase\,1 and
phase.1 are distinct,
and in demangled displays they should be presented as
'phase.1' and something like
'phase'.1, respectively.

Fine Print

These rules build upon the JVM specification, as modified by JSR 202.   The relevant language goes something like this:

4.3.2   Unqualified Names

Names of methods, fields and local variables are stored as unqualified
names.  Unqualified names must not contain the characters '.', ';', '[' or '/'. Method names are further constrained so that, with the exception of the special method names <init> and <clinit> (§3.9), they must not contain the characters '<' or '>'.



JVMs use these new relaxed identifier rules as of Java 5 and later.

The JVM requires that the bytecode name of a method not contain angle brackets,
but it allows them in fields and type names.
Actually, there is a problem with putting left angle brackets in type names, since
left angle bracket is also a delimiter character in generic
signature encodings, such as LFoo<T>;.
(Does the preceding mean an unparameterized type spelled
Foo<T>
or is it an instance of the generic spelled
Foo?)
Thus, left angle bracket is a character that the JVM
specification does not realize is dangerous!
A future version of the JVM spec. could amend this,
and simplify the rules, by declaring angle brackets
illegal everywhere (except for certain method names).
It should also make left square bracket legal in
type names, except in the first character position
of a qualified type name (where it may denote an array type).

There are plenty of other characters which look dangerous, but are innocuous to the JVM.
For example, in settings where method names are concatenated with method signatures,
it might seem that parentheses pose a danger to correctly parsing them apart again.
(Try this concatenation: (I)(D)(L)(J)(;)V.
Can you tell which is method name and which is signature?)
Such concerns about parsing apply (in various settings) to spaces, newlines, the null character, etc.
For this specification, we do not propose to predict all such parsing risks, and instead focus on
exactly those characters which the JVM itself disallows.
The concept of display names introduced above can help with applications which must produce
parseable text that include mangled names.
(The example above could be displayed in a more parser-friendly form as '(I)(D)'(L')(J)(';)V.)
In the case of method signatures, note that method (and field) names and signatures
are always presented in the class file as separate CONSTANT_Utf8 references,
so there is no need (inherent in the JVM) to concatenate or parse them.

Bytecode compilers are
encouraged to use this mangling whenever they represent a name directly
to the JVM (as a so-called “bytecode name”).  In the very few places (if any) where the JVM or
reflective APIs undertake to transform names from the bytecode
level to language-level strings, they should remove the mangling also.

The escape character is not a universal “superquote”.
There are only a few escape sequences, and any other occurrences
of the backslash (reverse solidus) character are not treated
specially. (They do not need further quoting to avoid treating
them as accidental escapes.) In this, the escape character
works like it does within Unix shell quoted strings, where
(for example) "\x" is a string
with two characters. 

It is actually useful define a few extra dangerous characters.
In many languages, symbols have attributes beyond their basic spelling.
(For example, Fortress symbols have a font attribute,
and Common Lisp symbols have a package prefix.)
Characters like square brackets and backquote which are allowed
by the JVM but dangerous in the present sense are opportunities
for adding more structure to the bytecode names of symbols.
Languages are therefore free to use the colon character,
and (in non-class names) square brackets and colon to
add structure to their symbols.
It may be best if the basic spelling of the symbol comes at a fixed position,
so that low-level tools have a reasonable chance of demangling
at least that part of the symbol. However, this is a matter for
another layer of software to decide, specifically a metaobject protocol
which is concerned with uniform naming and access of program elements.
A name which consists of a sequence of colons and
mangled names may be called a compound name.
It is natural to use compound names with the invokedyanamic instruction,
to convey necessary information about a call site to the metaobject protocol.
Names like x:get and
x:set are appealing for building clean
property APIs.

The slash, dot, and semicolon characters have a more central role
in the syntax of bytecode names and signatures, so languages
should not use them. Future extensions of the JVM could appropriate
these characters if the JVM itself needed to add structure to
bytecode names.
For example, the JVM specification allows dot . inside field names,
but since manglings avoid that dangerous character in field names,
then dot can be used, without conflict, as a delimiter for encoding
tuple element references.

Remaining Issues

In order to test these ideas, I have coded them in Java and written a small test harness called StringNames.java.
You are welcome to try it out.
There is also mangling code in the OpenJDK, as part of JSR 292. In the Da Vinci Machine patch repository there are unit tests.

One bit of work has not been attempted here. We do not
consider manglings over the much more restricted set
of characters allowed by earlier versions of the JVM.
Because these allowed only valid Java identifiers,
some sort of much more complex and disruptive scheme
would be required to encode free spellings as Java identifiers.
It would probably use the dollar sign $
(dollar) as an escape character, and encode non-alphabetics
as sequences of alphabetics in a high-base numbering system.
Collisions with pre-existing uses of the escape character
would complicate matters.

Another bit of work saved for later is removal of length restrictions.
JVM bytecode names (and signatures that contain them) must be
representable in less than 65536 bytes of (modified null-free) UTF8.
I guess freedom is more of a journey than a destination...

Change Log and Acknowledgements

March 2008: Per Bothner pointed out that the dollar sign is in fact dangerous,
since various bits of code (including Class.java)
look for it as a special delimiter in bytecode names.
So now we replace it by an escape sequence with a percent sign.
(I guess I was too close to that problem to see it!)

Note that all this stuff works in Java 5 and later.

September 2008: The Da Vinci Machine project has a patch to javac which lets you pass exotic names through the javac frontend. It does not mandate this or any other mangling scheme.

February 2009: Improved the language a little, and put in cross-references to invokedynamic and compound names.

August 2009: Tweaked the mangling and concatenation rules. (Hat tip to David Chase.)

September 2012: Fixed web page damage around backslashes, updated a couple pathnames. (Hat tip to John Cowan.)

Join the discussion

Comments ( 2 )
  • symbolic Thursday, January 24, 2008
    [Trackback] Bookmarked your post over at Blog Bookmarker.com!
  • guest Saturday, August 18, 2012

    During the conversion process, all the backslashes got doubled, thus unintentionally exhibiting an example of what's wrong with C-style backslash escapes. Please fix, as it's very confusing!


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services