### Recognizing all valid integral strings with regular expressions

For a fun Friday hack, this blog entry describes a program to generate a regular expression to recognize all the valid strings of integers in a given base, that is, a regular expression (regex) which accepts all in-range strings but rejects strings that would cause an overflow if converted.

This sort of regular expression could be used to verify a string won't cause a `NumberFormatException` when processed by one of the many text → number methods in the Java platform, such as `Integer.parseInt(String s, int base)`. However, using a regular expression may be more expensive than attempting the conversion and catching the exception on invalid input; the regular expression discussed is possibly of more academic interest than practical significance!

An analogous regular expression for floating-point strings has been published in the javadoc for several releases.

First, is the set of strings in a given base that will convert to a 32-bit or 64-bit integer a regular language, a language that can be recognized by a regular expression? Yes, because ignoring sequences of leading zeros for a moment, there are only a finite number of strings that can be converted to integer values, one string for each of the 232 or 264 values in the `int` or `long` types, respectively. All finite languages are regular, strings of zero of more leading zeros are regular, and concatenations of regular languages are regular. Putting those facts together, the entire set of convertible strings forms a regular language. This reasoning is not dependent on the base so it holds for all integer bases, including bases 2 through 36 supported by Java's libraries.

A core regular expression of billions of strings offered as alternatives ("1|2|3|...|2147483647") while fine from a mathematical perspective, would be too slow and awkward to use. Fortunately, the pattern of valid strings has sufficient structure to yield reasonably short and manageable regular expressions.

To start we will only consider decimal strings; additionally, we will assume only ASCII digits need to be recognized. Integer values range from

-2147483648

to

2147483647

These strings both have 10 digits. Putting aside leading zeros for the moment, all strings of 9 or fewer digits are valid, with or without a leading minus sign. A regex to recognize strings with 9 or fewer digits is trivial; less obvious is how to specify the "ragged" 10-digit nonzero strings which are in range.

The core regular expression covering non-ragged inputs is

```"(-)?"+			// optional leading minus sign
"0\*"+			// leading zeros
"(\\p{Digit}{1,9})"	// numbers with 1 to 9 digits
```

For the 10-digit strings, if the leading digit is less than the maximum digit, all other digits can have any value:

```1(\\p{Digit}{9})
```

If the first digit is at it's maximum, then if the second digit is less than its maximum, the third and subsequent digits can have any value:

```20\\p{Digit}{8}
```

Likewise, if the first and second digits are at their maximum, then if the third digit is less than its maximum, the fourth and subsequent digits can have any value. Continuing this pattern for all the digit positions:

```1(\\p{Digit}{9})|
20(\\p{Digit}{8})|
21[0-3](\\p{Digit}{7})|
214[0-6](\\p{Digit}{6})|
2147[0-3](\\p{Digit}{5})|
21474[0-7](\\p{Digit}{4})|
214748[0-2](\\p{Digit}{3})|
2147483[0-5](\\p{Digit}{2})|
21474836[0-3](\\p{Digit}{1})|
214748364[0-7]
```

This regular expression can be ORed with the regex for strings of 1 to 9 digits listed above. Finally, a separate case for the most negative value must be added:

```-0\*2147483648
```

All together, with some newlines for readability this yields:

```((-)?0\*((\\p{Digit}){1,9})
|([0-1]\\p{Digit}{9})
|(20\\p{Digit}{8})
|(21[0-3]\\p{Digit}{7})
|(214[0-6]\\p{Digit}{6})
|(2147[0-3]\\p{Digit}{5})
|(21474[0-7]\\p{Digit}{4})
|(214748[0-2]\\p{Digit}{3})
|(2147483[0-5]\\p{Digit}{2})
|(21474836[0-3]\\p{Digit}{1})
|(214748364[0-7])
)|
(-0\*2147483648)
```

The ragged pattern for nonzero strings with maximum possible length will exist for bases that aren't powers of 2, for the Java libraries that includes all conversion radixes other than 2, 4, 8, 16, and 32. For powers of two, the pattern is more regular. For example, in base 16 the values range from

```-80000000
```

to

```7fffffff
```

so for base 16 the regular expression can simply be

```((-)?0\*(\\p{XDigit}{1,7}|
[0-7](\\p{XDigit}{7}))|
(-0\*80000000)
```

Generalizing the regular expression generation algorithm to any given base:

1. Create a regex for an optional leading minus sign ("`-`") and zero or more leading zeros, "`0\*`."

2. For the base in question, generate the strings for `MIN_VALUE` and `MAX_VALUE`.

3. Find n, the length of the `MAX_VALUE` string. Signed strings of that base with (n-1) or fewer characters are all valid inputs; create a regex recognizing (n-1) or fewer digits.

4. Create a regex to recognize valid strings of length n.

5. Combine the above regular expressions.

6. To the above, concatenate a separate regex for the most negative value,

`-(0\*)MIN_VALUE_STRING`

Taken together, this regex will recognize exactly those strings convertible to the base in question.

Untested code implementing this algorithm is available for your viewing pleasure. Any further debugging, tuning, and enhancements are left as "an exercise for the reader," happy hacking!

Comments:

.. and how many hundreds of times slower is that regex compared to a single strtol() call? :)

Posted by Dan Nelson on February 05, 2010 at 07:59 AM PST #

Post a Comment:
Comments are closed for this entry.

Darcy-Oracle

##### Archives
Sun Mon Tue Wed Thu Fri Sat « November 2015 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Today
##### News

No bookmarks in folder

RSS Atom