X

Musings on JDK development

  • Java |
    September 10, 2009

Unhashing a String

My working on the strings in switch implementation has gotten swapped back in.
The implementation is following the translation strategy outlined in the
strings in switch proposal: find a perfect hash function for the input strings and semantically replace case "Foo": with case hash("Foo"): (along with some additional checks and logic) where hash("Foo") is computed as a constant by the compiler.

With this approach, to write the regression tests it is helpful to be able to construct a string with a given hash value to force collisions and test the alternate logic, which led me to write the "unhash" method below to create a string with a given hash code:

/\*\*
\* Returns a string with a hash equal to the argument.
\* @return string with a hash equal to the argument.
\*/
public static String unhash(int target) {
StringBuilder answer = new StringBuilder();
if (target < 0) {
// String with hash of Integer.MIN_VALUE, 0x80000000
answer.append("\\u0915\\u0009\\u001e\\u000c\\u0002");
if (target == Integer.MIN_VALUE)
return answer.toString();
// Find target without sign bit set
target = target & Integer.MAX_VALUE;
}
unhash0(answer, target);
return answer.toString();
}
private static void unhash0(StringBuilder partial, int target) {
int div = target / 31;
int rem = target % 31;
if (div <= Character.MAX_VALUE) {
if (div != 0)
partial.append((char)div);
partial.append((char)rem);
} else {
unhash0(partial, div);
partial.append((char)rem);
}
}

The algorithm for hashing strings multiplies the old hash value by 31 and adds in the integer value of the next character:

h = 0;
for (int i = 0; i < len; i++) {
h = 31\*h + val[off++];
}

The unhash method works in reverse; for the non-negative values handled by unhash0, divide the target value by 31:

  • if the quotient is less than Character.MAX_VALUE, the quotient and remainder can be fully captured using at most two characters.

  • if the quotient is greater than or equal to Character.MAX_VALUE, unhash the quotient and append to that string a character for the remainder.

With some additional care, negative values can reuse the same process. The key observation is that if a string hashes to Integer.MIN_VALUE (0x80000000), subsequent multiples by 31 (0x1f) do not change the result since

0x1f × 0x80000000 = 0xf80000000

which is again 0x80000000 when limited to int range. Therefore, the sign bit can be set and then the remaining bits handled as if the target were positive.

Using a few minutes of computer time, I tested the unhash method on all integer values and it always returned a correct string. When available, exhaustive testing is pleasantly simple and reassuring! The generated strings are relatively short; as shown in the table below, for non-negative values the average length is slightly over four characters.











Distribution of String Lengths of Unhashed Non-negative Values
Length Frequency
1 31
2 2,031,585
3 60,948,480
4 1,889,402,880
5 195,100,672
Total 2,147,483,648

The unhash of 0 could be special-cased to return the empty string of length zero rather than a length-one string of the \\u0000 character, but this was not necessary for the purposes at hand. Likewise, generating somewhat shorter strings for negative hash values may be possible, but further investigation is not needed just to be able to generate collisions. As is stands, unhash will return a string whose length is at most ten; five characters for a negative sign bit and at most another five characters for the non-sign bits.

Join the discussion

Comments ( 7 )
  • Damon Hart-Davis Thursday, September 10, 2009

    Nicely devious... B\^>

    Now, would you contemplate thinking about possibly evaluating in the fullness of time ... generalising this to work over CharSequence instead?

    Basically I've been resorting to CharSequence derivatives to make for more memory-efficient substitutes for String.

    You might have to explicitly compute a hash (and thus losing the caching within String) for a non-String switch value to ensure that you obtained the hash that you expect, but that might be the extent of the additional effort required.

    Rgds

    Damon


  • Joe Darcy Thursday, September 10, 2009

    @Damon,

    Strings in switch would not generalize well to arbitrary CharSequences since CharSequence objects are not necessarily immutable and their value could thus change between the time the hash was taken and when the value was used inside the corresponding case(s). Also the exact behavior equals and hashCode methods on arbitrary CharSequences are not defined.

    Switching on myCharSequence.toString() would work, but a String would need to be created of course.


  • Damon Hart-Davis Thursday, September 10, 2009

    Hi,

    I hear what you are saying about immutability, but assuming the labels are (or behave as if) constant String values then all you would need to do for a non-String CharSequence switch value is compute its hash with the algorithm that String uses on charAt() for the value, not calling the value's hashCode() as you imply, avoiding the String creation. Likewise you would then use a charAt()-wise equals() replacement. If you were worried about keeping the JVM and String hash impl/algo in step then that would be a bigger issue.

    And if users of the switch are silly enough to mutate their value (eg in another thread) while switching on it, well... Maybe there needs to be some kind of mutability Darwin award for bad coding? B\^>

    Rgds

    Damon


  • Peter Lawery Sunday, September 13, 2009

    A non recursive version for you.

    public static String unhash(int hash) {

    long h = hash & ((1L << 32) - 1);

    StringBuilder sb = new StringBuilder();

    while (h > 0) {

    sb.append((char) (h % 31));

    h /= 31;

    }

    return sb.reverse().toString();

    }


  • Sogon Tuesday, November 3, 2009

    It is possible to select from a variety of hash functions as determined by the actual cases.

    If all cases are very short the hashing of the switch variable may incur a significant overhead.

    In some cases, such as when each case has a unique length, the string length can be the hash function. This algorithm selection will allow the best performance and lowest level of complexity in the resulting bytecode


  • Joe Darcy Tuesday, November 3, 2009

    @Sogon,

    Yes, the possibility of generating per-switch statement perfect hash functions is discussed in the original strings in switch proposal:

    "Project Coin: Proposal for Strings in switch"

    http://blogs.sun.com/darcy/entry/project_coin_strings_in_switch

    For the right inputs, the length of the string is one candidate function.

    The future implementation of the strings in switch can be enhanced and optimized based on usage.


  • jim smith Saturday, November 21, 2009

    Joe: really glad to see that String switches are gonna make it in the next release. Clever efficient implementation too!


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services