Named Capturing Group in JDK7 RegEx

In a complicated regular expression, which might have many capturing and
non-capturing groups, the left to right group number counting might get a little confusing, and the expression itself (the group and its back
reference) becomes hard to understand and trace. A natural solution is
to, instead of counting it manually one by one and left to right, give each group a name, and then back reference it (in the same regex) or access the capturing match result (from MatchResult) by the assigned NAME, as what Python, PHP, .Net and Perl (5.10.0) do in their regex engines. This convenient feature has been missed in Java RegEx for years, now it finally got itself in JDK7 b50.


The newly added RegEx constructs to support the named capturing group are:

(1) (?<NAME>X) to define a named group NAME"                     
(2) \\k<Name> to backref a named group "NAME"                  
(3) <$<NAME> to reference to captured group in matcher's replacement str
(4) group(String NAME) to return the captured input subsequence by the given "named group"


With these new constructs, now you can write something like
    String pStr = "0x(?<bytes>\\\\p{XDigit}{1,4})\\\\s++u\\\\+(?<char>\\\\p{XDigit}{4})(?:\\\\s++)?";
    Matcher m = Pattern.compile(pStr).matcher(INPUTTEXT);
    if (m.matches()) {
        int bs = Integer.valueOf(m.group("bytes"), 16);
        int c =  Integer.valueOf(m.group("char"), 16);
        System.out.printf("[%x] -> [%04x]%n", bs, c);
    }


    or

    System.out.println("0x1234 u+5678".replaceFirst(pStr, "u+$<char> 0x$<bytes>"));


OK, examples above just show how to use these new constructs, not necessary mean they are "better":-) more "easy" to understand for me though.

The method group(String name) is NOT added into MatchResult interface for the compatibility concern ( personally I don't think it's a big deal, my guess is the majority of RegEx users can just live with the Matcher class, the compatibility weighs more here. Let me know otherwise).

Comments:

It would be great if we had accessed to the names used in the regular expression.

Something like:
String ss;
String pStr = "0x(?<bytes>\\\\p{XDigit}{1,4})\\\\s++u\\\\+(?<char>\\\\p{XDigit}{4})(?:\\\\s++)?";
Matcher m = Pattern.compile(pStr).matcher(INPUTTEXT);
if (m.matches()) {
for(int i=0; i<m.groupCount(); i++) {
if ( (ss=m.getGroupName(i)) == null ) ss = "";
int bs = Integer.valueOf(m.group(i), 16);
System.out.printf("[#%d %s] -> [%04x]%n", i, ss, bs);
}

Posted by Erich on September 12, 2010 at 04:25 PM PDT #

Or maybe more useful:

m.getGroupNames();
1 = 'name1',
2 = 'name2',
3 = 'name3',

and m.getMatchedNames();
as (only) matched names:
2 = 'name2',

Posted by Erich on September 12, 2010 at 05:30 PM PDT #

The OP's syntax was garbled: Too many backslashes and the back references where using $< ... > instead of ${ ... }. I found the curly braces solution here: http://stackoverflow.com/questions/415580/regex-named-groups-in-java#answer-415635

String pStr = "0x(?<bytes>\\p{XDigit}{1,4})\\s+u\\+(?<char>\\p{XDigit}{4})";
System.out.println("0x1234 u+5678".replaceFirst(pStr, "u+${char} 0x${bytes}"));

Still wondering why it's not mentioned in the official javadoc
http://download.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#groupname

Posted by Bernhard Wagner on July 20, 2011 at 02:46 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

xuemingshen

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today