Thursday May 26, 2011

A Facelift for the Java Regular Expression Unicode support

As documented in the "Unicode Support" section of the latest JDK6 Pattern API Java regular expression supports the Unicode hex notation, Unicode properties like general category, block, uppercase, lowercase... the simple word boundaries, case mapping, supplementary characters, etc, and has been claiming to be "in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression, plus RL2.1 Canonical Equivalents" from the beginning (JDK 1.4),  in which we believe should satisfy most of the normal developer requirements for Unicode support. However early this year Tom Christiansen took some time from his busy Perl 5.14 schedule to perform a comprehensive reality check of the java Unicode support, focusing on the regular expression. Most of Tom's reports and the follow up discussions between Tom and me can be found at the OpenJDK's i18n-dev mailing archives. To summarize,


  • Though arguably the existing Unicode escape sequence for UTF-16, the \nXXXX notation, is indeed a mechanism/notation to specify the Unicode code point  in Java regular expression,  the "newly" added Perl \x{...} is definitely a more convenient construct and serves better to meet the RL1.1 Hex Nortation requirement.

  • The predefined POSIX character classes, such as \p{Upper}, \p{Lower}, \p{Space}, \p{Alpha}, etc. are all ASCII only version in j.u.regex, which obviously does not meet the Unicode guideline RL1.2a Compatibility Properties. (btw, Perl has already evolved/migrated from their early ASCII only implementation to the Unicode version years ago)

  • Some of those most frequently used/referred (and requested by the guideline RL1.2 Properties ) properties such as Script, Alphabetic, Uppercase, Lowercase and White_Space (via {javaUppercase}, \p{javaLowercase}, \p{Whitespace}) are either missing or have "slightly" different interpretation compared to the standard. One of the reasons causes this situation  is that Java did not have the access (in existing implementation) to those properties defined in PropList.txt, so for example, the Alphabetic is complete missing, and Character.isLowerCase only takes GC=LOWERCASE_LLETTER codepoint as "lowercase", does not count the Other_Lowercase.

  • The behaviors of the predefined character classes \b, \B and \w \W in  Java regex are "broken", use Tom's words, they break the historical connection between the word caracters \w and the word boundaries \b, for "no sound reason". Because \w is defined as a ASCII only version as "A word character: [a-zA-Z_0-9] ", while the implementation \b is a "semi-" Unicode-aware version. The result, as showed in the sample Tom gived, is that "élève" is NOT matched by pattern \b\w+\b. So the RL1.4 Simple Word Boundaries requirement is obvious NOT met.

  • Too many "errors" in the Perl related wording in the j.u.regex.Pattern doc, it's way over-due for an update. 



It's an embarrassing reality. So while JDK7 was at a very late stage of the release circle, we were convinced that it might be worth taking some risks to give the regex Unicode support a face lift to make the situation better (at this late release stage). Here is a summary of what we have done so far, we believe with these changes now the Unicode support in RegEx is in a much better shape.



The support of the new Perl style \x{...} construct


To use Java Unicode escape sequences notation \uXXXX was the only way to specify a Unicode code point in Java regex before JDK7. It's really not convenient and straightforward when you have to deal with supplementary character, in which two consecutive Unicode sequences (as the surrogate pair) have to be used, for example,

    int codePoint = ...;
    String hexPattern = codePoint <= 0xFFFF
                        ? String.format("\\u%04x", codePoint)
                        : String.format("\\u%04x\\u%04x",
                                 (int)Character.highSurrogate(codePoint),
                                 (int)Character.lowSurrogate(codePoint));

compared to simply do


  String hexPattern = "\x{" + Integer.toHexString(codePoint) + "}";


A new flag UNICODE_CHARACTER_CLASS is added to support Unicode version of POSIX character classes



With this flag set, the ASCII only predefined character classes (\b \B \w \W \d \D \s \S) and POSIX character classes (\p{alpha}, \p{lower}, \{upper}...) are flipped to Unicode version*, as documented in the latest JDK7 j.u.regex.Pattern API doc. And this flag also enables the UNICODE_CASE, if the new flag UNICODE_CHARACTER_CLASS is specified, anything Unicode). While ideally I would like to just evolve/upgrade the Java regex from the aged ASCII-only to Unicode (maybe a ASCII_ONLY_COMPATIBILITY mode as a fallback), like what Perl did years ago, ut given the Java's "compatibility" spirit (and the performance concern as well), this is unlikely to be accepted.

* This addressed the \b\w+\b various issues


\p{IsBinaryProperty} is introduced in to explicitly support Unicode binary properties



  • Alphabetic

  • Ideographic

  • Letter

  • Lowercase

  • Uppercase

  • Titlecase

  • Punctuation

  • Control

  • White_Space

  • Digit

  • Hex_Digit

  • Noncharacter_Code_Point

  • Assigned


The spec and implementation of both j.l.C.isLowerCase() and j.l.C.isUpperCase() have been also updated to match the Unicode Standard definition of "lowercase" (LOWERCASE_LETTER + Other_Lowercase) and "uppercase" (UPPERCASE_LETTER + Other_Uppercase). And we have two more friends j.l.C.isIdeographic() and j.l.C.isAlphabetic(). All above Unicode properties now fully follow the Unicode Standard.


Unicode Scripts are now supported


Scripts are specified either with the prefix Is, as in \p{IsHiragana}, or by using the script keyword (or its short form sc) as in \p{script=Hiragana} or \p{sc=Hiragana}. The script names supported by Pattern are the valid script names accepted and defined by UnicodeScript.forName. (Yes, the j.l.C.UnicodeScript is also new, might write a separate blog for that).

And yes, we have corrected those errors in the Perl related document, with Tom's help.

Sure, most of enhancements mentioned here are probably "who cares" for most developers:-) but if you have read this far, I guess you do care, so here is the latest JDK7 j.u.regex API doc.  And...yes, the \N{...}, \X are on my to-do list:-)


And a BIG THANKS to Tom Christiansen!!!


About

xuemingshen

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today