Project Coin: Taking a Break for Strings in Switch

The initial way a string switch statement was implemented in JDK 7 was to desugar a string switch into a series of two switch statements, the first switch mapping from the argument string's hash code to the ordinal position of a matching string label followed by a second switch mapping from the computed ordinal position to the code to be executed. Before this approach was settled on, Jon, Maurizio, and I had extensive discussions about alternative implementation techniques. One approach from Maurizio we seriously considered using employed labeled break statements (in lieu of unavailable goto statements) to allow a string switch to be desugared into a single integer switch statement. In this approach as well, the basis for the integer switch built around the strings' hash codes.

One kind of complication in desugaring string switch statements stems from irregular control flow, such as when control transfers to one label, code is executed, and then control falls through to the code under the next label rather than exiting the switch statement after the initial code execution. When using hash codes to identify the string being switched on, another class of complications stem from dealing with the possibility of hash collisions, the situation where two distinct strings have the same hash code. A string can be constructed to have any integer hash code so collisions are always a possibility. Since many strings have the same hash code, it is not sufficient to verify the string being switched on just has the same hash value as a string case label; the string being switched on must be checked for equality with the case label string. Furthermore, when two string case labels have the same hash value, a string being switched on with a matching hash code must be checked for equality potentially against both case labels.

While relying on hash codes to implement string switch is contentious with some, the hashing algorithm of java.lang.String is an extremely stable part of the platform and there would be too much behavioral compatibility risk in changing it. Therefore, the stability of the algorithm can be relied on as a resource in possible string switch implementations. Switching on the hash code in the desugaring confers a number of benefits. First, most immediately the hash code maps the string to an integer value, matching the type required for the existing switch statement. Second, switching on the hash code of a string bounds the worst case behavior. The simplest way to see if a chosen string is in a set of other strings, such as the set of string case labels, would be to compare the chosen string to each of the strings in the set. This could be expensive since the chosen string would need to be traversed many times, potentially once for each case label. The hash code of a string is typically cached after it is first computed. Therefore, when switching on the hash code, the chosen string is not expected to be traversed more than twice (once to compute the hash code if not cached, again to compare against strings from the set of strings with the same hash value — a set usually with only one element).

If instead of a series of two desugared switch statements, only a single switch statement were desired in the desugaring, extra synthetic state variables could be used to contend with hash collisions, fall-through control flows, and default cases, as described in the Project Coin strings in switch proposal. A goto construct could be used to eliminate state variables, but goto is neither available in the source language nor in javac's intermediate representation. However, by a novel use of nested labeled breaks, a single switch statement can be used in the desugaring without introducing additional synthetic control variables.

Consider the strings switch statement in the method f below

static void f(String s) { // Original sugared code
  switch (s) {
    case "azvl":
      System.out.println("azvl: "+s); // fallthrough
    case "quux":
      System.out.println("Quux: "+s); // fallthrough
    case "foo":
      int i = 5; //fallthrough
    case "bar":
      System.out.println("FooOrBar " + (i = 6) + ": "+s);
      break;
    case "bmjrabc": // same hash as "azvl"
      System.out.println("bmjrabc: "+s);
      break;
    case "baz":
       System.out.println("Baz " + (i = 7) + ": "+s); // fallthrough
    default:
      System.out.println("default: "+s);
  }
}

and the following desugaring procedure. Create a labeled block to enclose the entire switch statement. Within that enclosing block, create a series of nested blocks, one for each case label, including a default option, if any. In the innermost block, have a switch statement based on the hash code of the strings in the original case labels. For each hash value present in the set of case labels, have an if-then-else chain comparing the string being switched on to the cases having that hash value, breaking to the corresponding label if there is a match. If a match does not occur, if the original switch has a default option, a break should transfer control to the label for the default case; if the original case does not have a default option, a break should occur to the switch exit label.

If a hash value only corresponds to a single case label, the sense of the equality/inequality comparison in the desugared code can be tuned for branch prediction purposes. After the block for a case label is closed, the code for that alternative appears. In the original switch code, there are two normal completion paths of interest: the code for an alternative is run and execution falls through to the next alternative or there is an unlabeled break to exit the switch. In the desugaring, these paths are represented by execution falling through to code for the next alternative and by a labeled break to the label synthesized for the switch statement exit. The preservation of fall through semantics is possible because the code interspersed in the nested labeled statements appears in the same textual order as in the original "sugared" string switch. Local variables can be declared in the middle of a switch block. In desugared code, such variable declarations are hoisted out to reside in the block for the entire switch statement; the declaration of the variable and its uses are then renamed to a synthetic value to avoid changing the meaning of names in other scopes. Sample results of this procedure are shown below.

static void f(String s) { // Desugared code
  $exit: {
    int i$foo = 0;
    $default_label: {
      $baz: {
        $bmjrabc: {					    
          $bar: {
            $foo: {
              $quux: {
                $azvl: {
                  switch(s.hashCode()) { // cause NPE if s is null
	          case 3010735: // "azvl" and "bmjrabc".hashCode()
                      if (s.equals("azvl"))
		        break $azvl;          
                      else if (s.equals("bmjrabc"))
                        break $bmjrabc;
                      else
                        break $default_label;
                    case 3482567: // "quux".hashCode()
                      if (!s.equals("quux")) // inequality compare
                        break $default_label;
                      break $quux;
                    case 101574: // "foo".hashCode()
                      if (s.equals("foo")) // equality compare
                        break $foo;
                      break $default_label;
                    case 97299:  // "bar".hashCode()
                      if (!s.equals("bar"))
                        break $default_label;
                      break $bar;
                    case 97307: // "baz".hashCode()
                      if (!s.equals("baz"))
                        break $default_label;
                      break $baz;
                    default:
                      break $default_label;
                  }//switch
                }//azvl
                System.out.println("azvl: "+s); // fallthrough
              } //quux
              System.out.println("Quux: "+s); // fallthrough
            } //foo
            i$foo = 5;
          }//bar
          System.out.println("FooOrBar " + (i$foo = 6) + ": "+s);
          break $exit;
        }//bmjrabc
        System.out.println("bmjrabc: " + s);
        break $exit;
      } //baz
      System.out.println("Baz " + (i$foo = 7) + ": "+s); // fallthrough
    }//default_label
    System.out.println("default: "+s);
  }//exit
}

While the series of two switches and the labeled break-based desugaring were both viable alternatives, we choose the series of two switches since the transformation seemed more localized and straightforward. The two-switch solution also has simpler interactions with debuggers. If string switches become widely used, profiling information can be used to guide future engineering efforts to optimize their performance.

Comments:

Labels can be used for pure blocks?! You changed my world view.

Posted by halyavin on February 11, 2010 at 09:55 PM PST #

@halyavin,

The desugared code as listed is legal Java and if put inside a class will compile and run.

Posted by Joe Darcy on February 12, 2010 at 12:56 AM PST #

How about having the compiler detect hash code clashes in constants, and for those rare occasions, instead of switching on string.hashCode(), switch on string.length()+string.hashCode() or string.length()+string.hashCode() + (string.length>0 ? string.charAt(0): -1) or whatever combination is needed to uniquely identify the cases? It should be possible, given a known set of case strings, to generate a O(1) function to distinguish them.

You will still need the .equals method to catch accidentally matching strings coming in, but you would never need the "else if" method inside the case statements.

Posted by Niels Harremoes on February 12, 2010 at 06:44 AM PST #

@Niels,

Yes, the possibility of using hash functions other than String.hashCode is certainly possible and is discussed in the Project Coin proposal form for strings in switch as well as in a comment in the javac implementation. In the extreme, there are algorithms for computing a custom perfect hash function for a given set of inputs.

However, the engineering and testing costs for building all that are not warranted for the initial implementation; future uses of string switch will help guide what further optimizations are reasonable.

Posted by Joe Darcy on February 12, 2010 at 07:46 AM PST #

Looks nice, but if we get switch on string why not on long? (or did I miss that?) That would seem a much more obvious extension especially with the growing prevalence of 64bit architectures.

Posted by rod burman on March 23, 2010 at 02:43 AM PDT #

@Rod,

No; neither before nor after Project Coin is switching on long supported. The JVM instructions implementation switch only support int arguments and the most natural way to support long switches would be to change the JVM instruction, which would be too large of a change for Project Coin.

Posted by Joseph Darcy on March 23, 2010 at 03:32 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

darcy

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
News

No bookmarks in folder

Blogroll