Thursday May 26, 2011

A Facelift for the Java Regular Expression Unicode support

As documented in the "Unicode Support" section of the latest JDK6 Pattern API Java regular expression supports the Unicode hex notation, Unicode properties like general category, block, uppercase, lowercase... the simple word boundaries, case mapping, supplementary characters, etc, and has been claiming to be "in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression, plus RL2.1 Canonical Equivalents" from the beginning (JDK 1.4),  in which we believe should satisfy most of the normal developer requirements for Unicode support. However early this year Tom Christiansen took some time from his busy Perl 5.14 schedule to perform a comprehensive reality check of the java Unicode support, focusing on the regular expression. Most of Tom's reports and the follow up discussions between Tom and me can be found at the OpenJDK's i18n-dev mailing archives. To summarize,

  • Though arguably the existing Unicode escape sequence for UTF-16, the \nXXXX notation, is indeed a mechanism/notation to specify the Unicode code point  in Java regular expression,  the "newly" added Perl \x{...} is definitely a more convenient construct and serves better to meet the RL1.1 Hex Nortation requirement.

  • The predefined POSIX character classes, such as \p{Upper}, \p{Lower}, \p{Space}, \p{Alpha}, etc. are all ASCII only version in j.u.regex, which obviously does not meet the Unicode guideline RL1.2a Compatibility Properties. (btw, Perl has already evolved/migrated from their early ASCII only implementation to the Unicode version years ago)

  • Some of those most frequently used/referred (and requested by the guideline RL1.2 Properties ) properties such as Script, Alphabetic, Uppercase, Lowercase and White_Space (via {javaUppercase}, \p{javaLowercase}, \p{Whitespace}) are either missing or have "slightly" different interpretation compared to the standard. One of the reasons causes this situation  is that Java did not have the access (in existing implementation) to those properties defined in PropList.txt, so for example, the Alphabetic is complete missing, and Character.isLowerCase only takes GC=LOWERCASE_LLETTER codepoint as "lowercase", does not count the Other_Lowercase.

  • The behaviors of the predefined character classes \b, \B and \w \W in  Java regex are "broken", use Tom's words, they break the historical connection between the word caracters \w and the word boundaries \b, for "no sound reason". Because \w is defined as a ASCII only version as "A word character: [a-zA-Z_0-9] ", while the implementation \b is a "semi-" Unicode-aware version. The result, as showed in the sample Tom gived, is that "élève" is NOT matched by pattern \b\w+\b. So the RL1.4 Simple Word Boundaries requirement is obvious NOT met.

  • Too many "errors" in the Perl related wording in the j.u.regex.Pattern doc, it's way over-due for an update. 

It's an embarrassing reality. So while JDK7 was at a very late stage of the release circle, we were convinced that it might be worth taking some risks to give the regex Unicode support a face lift to make the situation better (at this late release stage). Here is a summary of what we have done so far, we believe with these changes now the Unicode support in RegEx is in a much better shape.

The support of the new Perl style \x{...} construct

To use Java Unicode escape sequences notation \uXXXX was the only way to specify a Unicode code point in Java regex before JDK7. It's really not convenient and straightforward when you have to deal with supplementary character, in which two consecutive Unicode sequences (as the surrogate pair) have to be used, for example,

    int codePoint = ...;
    String hexPattern = codePoint <= 0xFFFF
                        ? String.format("\\u%04x", codePoint)
                        : String.format("\\u%04x\\u%04x",

compared to simply do

  String hexPattern = "\x{" + Integer.toHexString(codePoint) + "}";

A new flag UNICODE_CHARACTER_CLASS is added to support Unicode version of POSIX character classes

With this flag set, the ASCII only predefined character classes (\b \B \w \W \d \D \s \S) and POSIX character classes (\p{alpha}, \p{lower}, \{upper}...) are flipped to Unicode version*, as documented in the latest JDK7 j.u.regex.Pattern API doc. And this flag also enables the UNICODE_CASE, if the new flag UNICODE_CHARACTER_CLASS is specified, anything Unicode). While ideally I would like to just evolve/upgrade the Java regex from the aged ASCII-only to Unicode (maybe a ASCII_ONLY_COMPATIBILITY mode as a fallback), like what Perl did years ago, ut given the Java's "compatibility" spirit (and the performance concern as well), this is unlikely to be accepted.

* This addressed the \b\w+\b various issues

\p{IsBinaryProperty} is introduced in to explicitly support Unicode binary properties

  • Alphabetic

  • Ideographic

  • Letter

  • Lowercase

  • Uppercase

  • Titlecase

  • Punctuation

  • Control

  • White_Space

  • Digit

  • Hex_Digit

  • Noncharacter_Code_Point

  • Assigned

The spec and implementation of both j.l.C.isLowerCase() and j.l.C.isUpperCase() have been also updated to match the Unicode Standard definition of "lowercase" (LOWERCASE_LETTER + Other_Lowercase) and "uppercase" (UPPERCASE_LETTER + Other_Uppercase). And we have two more friends j.l.C.isIdeographic() and j.l.C.isAlphabetic(). All above Unicode properties now fully follow the Unicode Standard.

Unicode Scripts are now supported

Scripts are specified either with the prefix Is, as in \p{IsHiragana}, or by using the script keyword (or its short form sc) as in \p{script=Hiragana} or \p{sc=Hiragana}. The script names supported by Pattern are the valid script names accepted and defined by UnicodeScript.forName. (Yes, the j.l.C.UnicodeScript is also new, might write a separate blog for that).

And yes, we have corrected those errors in the Perl related document, with Tom's help.

Sure, most of enhancements mentioned here are probably "who cares" for most developers:-) but if you have read this far, I guess you do care, so here is the latest JDK7 j.u.regex API doc.  And...yes, the \N{...}, \X are on my to-do list:-)

And a BIG THANKS to Tom Christiansen!!!

Tuesday Mar 24, 2009

Case-Insensitive Matching in Java RegEx

-Question1: Does Java RegEx support case-insensitive matching?
-Yes, it does.

-Question2: Can Java RegEx do "case-insensitive matching" on Non-ASCII text?
-Sure, Java RegEx supports not only the case-sensitive matching of characters in US-ASCII, Unicode case-folding is supported as well from day-one.

-Question3: Really? How to do that?
-It's actually fairly easy. They can be easily enabled by specifying the corresponding match flag(s), CASE_INSENSITIVE and UNICODE_CASE. For example

for ASCII only case-insensitive matching

    Pattern p = Pattern.compile("YOUR_REGEX", Pattern.CASE_INSENSITIVE);

or for Unicode case-folding matching

    Pattern p = Pattern.compile("YOUR_REGEX", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

For those String.matches(String regex) fans, you can use the Special constructs (?i) and (?iu) to "embed" the matching flag(s) in the regex body. for example


for ASCII case-insensitive matching, use (?iu) for Unicode case-folding.

And here is the bonus, you can also "turn off" the "flag(s)" by using (?-i).

-Question4: Cool, here is the tough one, does it realy work that way? I heard...
-.......OK, if you are using JDK6u2 (and later updates) or JDK7 the answer is "YES". But if you still use 6.0 or earlier versions, yes, we screwed it up "a little" in those releases:-( and we fixed it in JDK7 and have back-port the change into 6.0u2 already. So get the latest update now.

For those who want to know a little more details, here is the story.

The case folding spec in Java Regex clearly says

CASE_INSENSITIVE: By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag IN CONJUNCTION with this flag.

UNICODE_CASE: When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.

But our RegEx implementation in previous releases disagrees with our own spec:-(

(1)The flag UNICODE_CASE is mostly treated as "UNICODE_CASE_INSENSITIVE", means the matching is always case insensitive no matter the flag CASE_INSENSITIVE is enabled or not. The implementation only "accidently" follows the spec (match on case_sensitive) in character class case if the specified character is a basic Latin (ASCII) or a Latin-1 supplement(<=0xff).

(2)When CASE_INSENSITIVE is NOT together with a UNICODE_CASE, some Unicode case insensitive matching is still being done.

  a)Character Class Single with Latin-1 Supplement input
    regex "[\\u00e0]" matches input "\\u00c0"

  b)Character Class Range with any Non-ASCII input
    regex "[\\u00e0-\\u00e5]" matches "\\u00c2"
    regex "[\\u0430-\\u0431]" matches "\\u0411"

  c)Back Reference with any Non-ASCII input
    regex "(\\u00e0)\\\\1" matches "\\u00e0\\u00c0"
    regex "(\\u0430)\\\\1" matches "\\u0430\\u0410"

(3)The only place we "get it right" is the regex constructs for a single character (for example "\\u00e0" does not match "\\u00c0") or a slice of characters (for example "\\u00e0\\u00e1"
does not match "\\u00c0\\u00c1") do follow the spec to only allow ASCII characters to case insensitive match when only CASE_INSENSITIVE presented.

Above implementation inconsistency has been fixed in JDK7 and JDK6u2 to fully follow the specification.

\*\\u0000-\\u007f: Basic Latin (aka ASCII)
\*\\u0080-\\u00ff: Latin-1 Supplement
\*\\u0400-\\u04f9: Cyrillic

Sunday Mar 08, 2009

Named Capturing Group in JDK7 RegEx

In a complicated regular expression, which might have many capturing and
non-capturing groups, the left to right group number counting might get a little confusing, and the expression itself (the group and its back
reference) becomes hard to understand and trace. A natural solution is
to, instead of counting it manually one by one and left to right, give each group a name, and then back reference it (in the same regex) or access the capturing match result (from MatchResult) by the assigned NAME, as what Python, PHP, .Net and Perl (5.10.0) do in their regex engines. This convenient feature has been missed in Java RegEx for years, now it finally got itself in JDK7 b50.

The newly added RegEx constructs to support the named capturing group are:

(1) (?<NAME>X) to define a named group NAME"                     
(2) \\k<Name> to backref a named group "NAME"                  
(3) <$<NAME> to reference to captured group in matcher's replacement str
(4) group(String NAME) to return the captured input subsequence by the given "named group"

With these new constructs, now you can write something like
    String pStr = "0x(?<bytes>\\\\p{XDigit}{1,4})\\\\s++u\\\\+(?<char>\\\\p{XDigit}{4})(?:\\\\s++)?";
    Matcher m = Pattern.compile(pStr).matcher(INPUTTEXT);
    if (m.matches()) {
        int bs = Integer.valueOf("bytes"), 16);
        int c =  Integer.valueOf("char"), 16);
        System.out.printf("[%x] -> [%04x]%n", bs, c);


    System.out.println("0x1234 u+5678".replaceFirst(pStr, "u+$<char> 0x$<bytes>"));

OK, examples above just show how to use these new constructs, not necessary mean they are "better":-) more "easy" to understand for me though.

The method group(String name) is NOT added into MatchResult interface for the compatibility concern ( personally I don't think it's a big deal, my guess is the majority of RegEx users can just live with the Matcher class, the compatibility weighs more here. Let me know otherwise).




« June 2016