Case-Insensitive Matching in Java RegEx

-Question1: Does Java RegEx support case-insensitive matching?
-Yes, it does.

-Question2: Can Java RegEx do "case-insensitive matching" on Non-ASCII text?
-Sure, Java RegEx supports not only the case-sensitive matching of characters in US-ASCII, Unicode case-folding is supported as well from day-one.

-Question3: Really? How to do that?
-It's actually fairly easy. They can be easily enabled by specifying the corresponding match flag(s), CASE_INSENSITIVE and UNICODE_CASE. For example

for ASCII only case-insensitive matching

    Pattern p = Pattern.compile("YOUR_REGEX", Pattern.CASE_INSENSITIVE);

or for Unicode case-folding matching

    Pattern p = Pattern.compile("YOUR_REGEX", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

For those String.matches(String regex) fans, you can use the Special constructs (?i) and (?iu) to "embed" the matching flag(s) in the regex body. for example

    "XYZxyz".matches("(?i)[a-z]+")

for ASCII case-insensitive matching, use (?iu) for Unicode case-folding.

And here is the bonus, you can also "turn off" the "flag(s)" by using (?-i).

-Question4: Cool, here is the tough one, does it realy work that way? I heard...
-.......OK, if you are using JDK6u2 (and later updates) or JDK7 the answer is "YES". But if you still use 6.0 or earlier versions, yes, we screwed it up "a little" in those releases:-( and we fixed it in JDK7 and have back-port the change into 6.0u2 already. So get the latest update now.

For those who want to know a little more details, here is the story.

The case folding spec in Java Regex clearly says

CASE_INSENSITIVE: By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag IN CONJUNCTION with this flag.

UNICODE_CASE: When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.

But our RegEx implementation in previous releases disagrees with our own spec:-(

(1)The flag UNICODE_CASE is mostly treated as "UNICODE_CASE_INSENSITIVE", means the matching is always case insensitive no matter the flag CASE_INSENSITIVE is enabled or not. The implementation only "accidently" follows the spec (match on case_sensitive) in character class case if the specified character is a basic Latin (ASCII) or a Latin-1 supplement(<=0xff).

(2)When CASE_INSENSITIVE is NOT together with a UNICODE_CASE, some Unicode case insensitive matching is still being done.

  a)Character Class Single with Latin-1 Supplement input
    regex "[\\u00e0]" matches input "\\u00c0"

  b)Character Class Range with any Non-ASCII input
    regex "[\\u00e0-\\u00e5]" matches "\\u00c2"
    regex "[\\u0430-\\u0431]" matches "\\u0411"

  c)Back Reference with any Non-ASCII input
    regex "(\\u00e0)\\\\1" matches "\\u00e0\\u00c0"
    regex "(\\u0430)\\\\1" matches "\\u0430\\u0410"

(3)The only place we "get it right" is the regex constructs for a single character (for example "\\u00e0" does not match "\\u00c0") or a slice of characters (for example "\\u00e0\\u00e1"
does not match "\\u00c0\\u00c1") do follow the spec to only allow ASCII characters to case insensitive match when only CASE_INSENSITIVE presented.

Above implementation inconsistency has been fixed in JDK7 and JDK6u2 to fully follow the specification.

\*\\u0000-\\u007f: Basic Latin (aka ASCII)
\*\\u0080-\\u00ff: Latin-1 Supplement
\*\\u0400-\\u04f9: Cyrillic

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

xuemingshen

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today