### Unhashing a String

#### By Darcy-Oracle on Sep 10, 2009

My working on the strings in switch implementation has gotten swapped back in.
The implementation is following the translation strategy outlined in the
strings in switch proposal: find a perfect hash function for the input strings and semantically replace `case "Foo":`

with `case hash("Foo"):`

(along with some additional checks and logic) where `hash("Foo")`

is computed as a constant by the compiler.

With this approach, to write the regression tests it is helpful to be able to construct a string with a given hash value to force collisions and test the alternate logic, which led me to write the "unhash" method below to create a string with a given hash code:

/\*\* \* Returns a string with a hash equal to the argument. \* @return string with a hash equal to the argument. \*/ public static String unhash(int target) { StringBuilder answer = new StringBuilder(); if (target < 0) { // String with hash of Integer.MIN_VALUE, 0x80000000 answer.append("\\u0915\\u0009\\u001e\\u000c\\u0002"); if (target == Integer.MIN_VALUE) return answer.toString(); // Find target without sign bit set target = target & Integer.MAX_VALUE; } unhash0(answer, target); return answer.toString(); } private static void unhash0(StringBuilder partial, int target) { int div = target / 31; int rem = target % 31; if (div <= Character.MAX_VALUE) { if (div != 0) partial.append((char)div); partial.append((char)rem); } else { unhash0(partial, div); partial.append((char)rem); } }

The algorithm for hashing strings multiplies the old hash value by 31 and adds in the integer value of the next character:

h = 0; for (int i = 0; i < len; i++) { h = 31\*h + val[off++]; }

The unhash method works in reverse; for the non-negative values handled by `unhash0`

, divide the target value by 31:

if the quotient is less than

`Character.MAX_VALUE`

, the quotient and remainder can be fully captured using at most two characters.if the quotient is greater than or equal to

`Character.MAX_VALUE`

, unhash the*quotient*and append to that string a character for the*remainder*.

With some additional care, negative values can reuse the same process. The key observation is that if a string hashes to `Integer.MIN_VALUE`

(0x80000000), subsequent multiples by 31 (0x1f) do not change the result since

0x1f × 0x80000000 = 0xf80000000

which is again 0x80000000 when limited to `int`

range. Therefore, the sign bit can be set and then the remaining bits handled as if the target were positive.

Using a few minutes of computer time, I tested the unhash method on *all* integer values and it always returned a correct string. When available, exhaustive testing is pleasantly simple and reassuring! The generated strings are relatively short; as shown in the table below, for non-negative values the average length is slightly over four characters.

Length | Frequency |
---|---|

1 | 31 |

2 | 2,031,585 |

3 | 60,948,480 |

4 | 1,889,402,880 |

5 | 195,100,672 |

Total | 2,147,483,648 |

The unhash of 0 could be special-cased to return the empty string of length zero rather than a length-one string of the `\\u0000`

character, but this was not necessary for the purposes at hand. Likewise, generating somewhat shorter strings for negative hash values may be possible, but further investigation is not needed just to be able to generate collisions. As is stands, unhash will return a string whose length is at most ten; five characters for a negative sign bit and at most another five characters for the non-sign bits.