Unhashing a String

My working on the strings in switch implementation has gotten swapped back in. The implementation is following the translation strategy outlined in the strings in switch proposal: find a perfect hash function for the input strings and semantically replace case "Foo": with case hash("Foo"): (along with some additional checks and logic) where hash("Foo") is computed as a constant by the compiler.

With this approach, to write the regression tests it is helpful to be able to construct a string with a given hash value to force collisions and test the alternate logic, which led me to write the "unhash" method below to create a string with a given hash code:

/\*\*
  \* Returns a string with a hash equal to the argument.
  \* @return string with a hash equal to the argument.
  \*/
 public static String unhash(int target) {
    StringBuilder answer = new StringBuilder();
    if (target < 0) {
        // String with hash of Integer.MIN_VALUE, 0x80000000
        answer.append("\\u0915\\u0009\\u001e\\u000c\\u0002");

        if (target == Integer.MIN_VALUE)
            return answer.toString();
        // Find target without sign bit set
        target = target & Integer.MAX_VALUE;
    }
        
    unhash0(answer, target);
    return answer.toString();
}

private static void unhash0(StringBuilder partial, int target) {
    int div = target / 31;
    int rem = target % 31;

    if (div <= Character.MAX_VALUE) {
        if (div != 0)
            partial.append((char)div);
        partial.append((char)rem);
    } else {
        unhash0(partial, div);
        partial.append((char)rem);
    }
}

The algorithm for hashing strings multiplies the old hash value by 31 and adds in the integer value of the next character:

h = 0;
for (int i = 0; i < len; i++) {
    h = 31\*h + val[off++];
}

The unhash method works in reverse; for the non-negative values handled by unhash0, divide the target value by 31:

  • if the quotient is less than Character.MAX_VALUE, the quotient and remainder can be fully captured using at most two characters.

  • if the quotient is greater than or equal to Character.MAX_VALUE, unhash the quotient and append to that string a character for the remainder.

With some additional care, negative values can reuse the same process. The key observation is that if a string hashes to Integer.MIN_VALUE (0x80000000), subsequent multiples by 31 (0x1f) do not change the result since
0x1f × 0x80000000 = 0xf80000000
which is again 0x80000000 when limited to int range. Therefore, the sign bit can be set and then the remaining bits handled as if the target were positive.

Using a few minutes of computer time, I tested the unhash method on all integer values and it always returned a correct string. When available, exhaustive testing is pleasantly simple and reassuring! The generated strings are relatively short; as shown in the table below, for non-negative values the average length is slightly over four characters.

Distribution of String Lengths of Unhashed Non-negative Values
Length Frequency
1 31
2 2,031,585
3 60,948,480
4 1,889,402,880
5 195,100,672
Total 2,147,483,648

The unhash of 0 could be special-cased to return the empty string of length zero rather than a length-one string of the \\u0000 character, but this was not necessary for the purposes at hand. Likewise, generating somewhat shorter strings for negative hash values may be possible, but further investigation is not needed just to be able to generate collisions. As is stands, unhash will return a string whose length is at most ten; five characters for a negative sign bit and at most another five characters for the non-sign bits.

Comments:

Nicely devious... B\^>

Now, would you contemplate thinking about possibly evaluating in the fullness of time ... generalising this to work over CharSequence instead?

Basically I've been resorting to CharSequence derivatives to make for more memory-efficient substitutes for String.

You might have to explicitly compute a hash (and thus losing the caching within String) for a non-String switch value to ensure that you obtained the hash that you expect, but that might be the extent of the additional effort required.

Rgds

Damon

Posted by Damon Hart-Davis on September 10, 2009 at 05:26 AM PDT #

@Damon,

Strings in switch would not generalize well to arbitrary CharSequences since CharSequence objects are not necessarily immutable and their value could thus change between the time the hash was taken and when the value was used inside the corresponding case(s). Also the exact behavior equals and hashCode methods on arbitrary CharSequences are not defined.

Switching on myCharSequence.toString() would work, but a String would need to be created of course.

Posted by Joe Darcy on September 10, 2009 at 05:53 AM PDT #

Hi,

I hear what you are saying about immutability, but assuming the labels are (or behave as if) constant String values then all you would need to do for a non-String CharSequence switch value is compute its hash with the algorithm that String uses on charAt() for the value, not calling the value's hashCode() as you imply, avoiding the String creation. Likewise you would then use a charAt()-wise equals() replacement. If you were worried about keeping the JVM and String hash impl/algo in step then that would be a bigger issue.

And if users of the switch are silly enough to mutate their value (eg in another thread) while switching on it, well... Maybe there needs to be some kind of mutability Darwin award for bad coding? B\^>

Rgds

Damon

Posted by Damon Hart-Davis on September 10, 2009 at 07:13 AM PDT #

A non recursive version for you.

public static String unhash(int hash) {
long h = hash & ((1L << 32) - 1);
StringBuilder sb = new StringBuilder();
while (h > 0) {
sb.append((char) (h % 31));
h /= 31;
}
return sb.reverse().toString();
}

Posted by Peter Lawery on September 12, 2009 at 06:49 PM PDT #

It is possible to select from a variety of hash functions as determined by the actual cases.
If all cases are very short the hashing of the switch variable may incur a significant overhead.
In some cases, such as when each case has a unique length, the string length can be the hash function. This algorithm selection will allow the best performance and lowest level of complexity in the resulting bytecode

Posted by Sogon on November 03, 2009 at 03:45 AM PST #

@Sogon,

Yes, the possibility of generating per-switch statement perfect hash functions is discussed in the original strings in switch proposal:

"Project Coin: Proposal for Strings in switch"
http://blogs.sun.com/darcy/entry/project_coin_strings_in_switch

For the right inputs, the length of the string is one candidate function.

The future implementation of the strings in switch can be enhanced and optimized based on usage.

Posted by Joe Darcy on November 03, 2009 at 04:48 AM PST #

Joe: really glad to see that String switches are gonna make it in the next release. Clever efficient implementation too!

Posted by jim smith on November 21, 2009 at 03:25 AM PST #

Post a Comment:
Comments are closed for this entry.
About

darcy

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
News

No bookmarks in folder

Blogroll