Project Coin: Anatomy of adding strings in switch to javac

Earlier this week, I happily pushed an implementation of Project Coin's strings in switch language feature into a repository being used for JDK 7 milestone 5. JDK 7 binaries with strings in switch will be available for downloading in due course.

The javac compiler uses the standard compiler architecture of having successive phases and adding strings in switch required modifications to several links in the chain of phases.

Since JDK 1.4.x, javac has been multiple compilers in one since it has supported multiple source settings to specify what version of the language to use. (But when cross-compiling to an older language version, make sure to set the bootclasspath!) Information about which features are supported in a given version is encapsulated in javac's class com.sun.tools.javac.code.Source. When other parts of the compiler need to query if a feature is supported, they call a boolean method like Source.allowFeature(). Strings in switch added another method to this class; in diff notation:

+ public boolean allowStringsInSwitch() {
+ return compareTo(JDK1_7) >= 0;
+ }

The text of a source file being compiled is first turned into an abstract syntax tree (AST). The lexing and parsing processes which create the trees only verify the syntactic structure of the code. Semantic checks are done in subsequent phases of the compiler; many of those type checks are done during "attribution" by code in com.sun.tools.javac.comp.Attr. For switch statements, the constant-ness of case labels, having no repeated labels, and the labels' types matching the type of the expression being switched on are all handled in Attr. Attr also enforces the type restrictions on the switch expression; the expression can have an integer type, an enum type (as of JDK 5), and now a string type (as of JDK 7).

Because the trees in javac were sufficiently abstract, modifying Attr to support strings in switch was very straightforward. Most of the code was just for adding the allowStringsInSwitch check and generating an error message specific for a non-constant string. As shown in the patch below, no code implementing the actual semantic checks for non-repeated labels, etc. had to be adjusted to support strings in switch; all the existing checking logic was reused without modification.

--- a/src/share/classes/com/sun/tools/javac/comp/Attr.java	Thu Aug 27 13:40:48 2009 +0100
+++ b/src/share/classes/com/sun/tools/javac/comp/Attr.java	Mon Nov 02 21:36:59 2009 -0800
@@ -115,6 +115,8 @@ public class Attr extends JCTree.Visitor
         allowBoxing = source.allowBoxing();
         allowCovariantReturns = source.allowCovariantReturns();
         allowAnonOuterThis = source.allowAnonOuterThis();
+        allowStringsInSwitch = source.allowStringsInSwitch();
+        sourceName = source.name;
         relax = (options.get("-retrofit") != null ||
                  options.get("-relax") != null);
         useBeforeDeclarationWarning = options.get("useBeforeDeclarationWarning") != null;
@@ -166,6 +168,16 @@ public class Attr extends JCTree.Visitor
      \* API warnings.
      \*/
     boolean enableSunApiLintControl;
+
+    /\*\*
+     \* Switch: allow strings in switch?
+     \*/
+    boolean allowStringsInSwitch;
+
+    /\*\*
+     \* Switch: name of source level; used for error reporting.
+     \*/
+    String sourceName;
 
     /\*\* Check kind and type of given tree against protokind and prototype.
      \*  If check succeeds, store type in tree and return it.
@@ -886,7 +898,15 @@ public class Attr extends JCTree.Visitor
         boolean enumSwitch =
             allowEnums &&
             (seltype.tsym.flags() & Flags.ENUM) != 0;
-        if (!enumSwitch)
+        boolean stringSwitch = false;
+        if (types.isSameType(seltype, syms.stringType)) {
+            if (allowStringsInSwitch) {
+                stringSwitch = true;
+            } else {
+                log.error(tree.selector.pos(), "string.switch.not.supported.in.source", sourceName);
+            }
+        }
+        if (!enumSwitch && !stringSwitch)
             seltype = chk.checkType(tree.selector.pos(), seltype, syms.intType);
 
         // Attribute all cases and
@@ -909,7 +929,8 @@ public class Attr extends JCTree.Visitor
                     Type pattype = attribExpr(c.pat, switchEnv, seltype);
                     if (pattype.tag != ERROR) {
                         if (pattype.constValue() == null) {
-                            log.error(c.pat.pos(), "const.expr.req");
+                            log.error(c.pat.pos(),
+                                      (stringSwitch ? "string.const.req" : "const.expr.req"));
                         } else if (labels.contains(pattype.constValue())) {
                             log.error(c.pos(), "duplicate.case.label");
                         } else {

Two new error messages were added for strings in switch. The first reports that a constant string expression is required rather than just a constant expression. The second reports that a string switch statement has been found in a source level where that feature is not allowed. The javac convention for this kind of error message is to report the source level being used in the current compile and the source level where the structure starts being supported:

+compiler.err.string.const.req=\\
+    constant string expression required

+compiler.err.string.switch.not.supported.in.source=\\
+    strings in switch are not supported in -source {0}\\n\\
+(use -source 7 or higher to enable strings in switch)

The bulk of the compiler changes to support strings in switch were made in com.sun.tools.javac.comp.Lower, the compiler phase which translates away syntactic sugar, lowering structures to trees implementing the formerly sugar-coated functionality. For example, Lower already had a method to translate enum switches into a switch on an integer value retrieved from an enum → int map. The initial strings in switch implementation uses a similar technique: a single string switch in source code is lowered into a series of two switches. The first switch is a new synthesized switch on the string's hash code, which gets mapped to a label's ordinal position on the list of case statements. The second switch is structurally identical to the original string switch from the source except that the string case labels are replaced by integer positions and the computed position from the synthesized switch is the expression being switched on.

Discussion of various design issues and implementation alternatives can be found in the full code changes in Lower; see the webrev of the changeset for details. The chained series of switch statements to implement strings is switch is a refinement of the implementation approaches outlined in the original Project Coin strings in switch proposal.

Various smaller implement points about the code in Lower are worth noting too.

A pair of maps are built to hold information about translating string case labels → positions and hash codes → case labels with that hash. In order to make the compiler's output predictable and repeatable, the maps and sets used in these data structures are LinkedHashMaps and LinkedHashSets rather than just HashMaps and HashSets. In terms of functional correctness of code generated during a given compile, using HashMap and HashSet would be fine; the iteration order does not matter. However, we find it beneficial to have javac's output not vary based on implementation details of system classes.

The code synthesis in Lower uses a compiler-internal tree maker API. The resulting trees are unprocessed; the various consistency properties and derived information calculated by earlier compiler phases are not directly enforced. The code in Lower is responsible for generating correct code and computing necessary derived information; if this is not done correctly, later phases like Gen, which translates trees into byte code, can fail. For example, trees have an associated source position; this is preserved in debugging information. The position for the initial synthesized switch is set to the source position of the start of the original switch statement. The tree node for a break statement stores information about the target of the break, which is another tree node. Therefore, break statements synthesized as part of the initial switch need this target information to be set and any break nodes in from original switch statement need to have their targets reset to the newly generated enclosing switch where integer labels have replaced the string ones. This target patching logic gets thoroughly exercised by the tests with nested strings in switch statements!

The new tests verify that the proper structural checks are enforced for string switches as well as verify that the proper execution paths are taken on different inputs for switches with a variety of control flow shapes, including multiple case labels, case labels with colliding hash codes, and nested switches.

Now that strings in switch are specified, implemented, and tested, I'm looking forward to working on the remaining Project Coin features on the final acceptance list.

Comments:

Thanks for that!

I see I didn't lure you into generalising to CharSequence. %\^P

Anyhow, I look forward to using this: I already use the switch-on-enum from time-to-time and it's good to have.

Rgds

Damon

Posted by Damon Hart-Davis on November 04, 2009 at 05:23 PM PST #

This is cool.

But it just occurred to me that we could have pattern matching in Java:

switch(string) {
case /a\*b/:
// matches using a Regex
...
}

it wouldn't be much harder than a string switch

Posted by am on November 05, 2009 at 12:30 AM PST #

@Damon, this kind of generalization is not possible. String switch relies on the fact that hashCode implementation of String is fully specified.
So hashCodes can be pre-computed by the compiler with the garantee that for a String the hashCode() computed at runtime will be the same that the one computed at compile time.

@am, It will be not that hard but definitively a different implementation.
Here the trick is to take all cases, to put them between parenthesis and to separate them by a
or (|). With that you can use capture groups to find the regular expression that has matched the string. It's a kind of straw man lexer.

Rémi

Posted by Rémi Forax on November 09, 2009 at 05:46 AM PST #

Post a Comment:
Comments are closed for this entry.
About

darcy

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today
News

No bookmarks in folder

Blogroll