Iterating over the codepoints of a String

Recently I wanted to iterate over the code points of a String instead of its char values. Unicode 3.1 added supplementary characters, bringing the total number of characters to more than the 216 characters that can be distinguished by a single 16-bit char value. Therefore, a char value no longer has a one-to-one mapping to the fundamental semantic unit in Unicode. JDK 5 was updated to support the larger set of character values. Instead of changing the definition of the char type, some of the new supplementary characters are represented by a surrogate pair of two char values. To reduce naming confusion, a code point will be used to refer to the number that represents a particular Unicode character, including supplementary ones.

Therefore, a sequence of char values can be thought of as a variable-length encoding of a sequence of code points; the older characters (in the Basic Multilingual Plane) are represented by a single char value while the newer supplementary characters take two char values. The definition of language concepts, like identifiers, was rephrased in terms of code points instead of chars. The existing isFoo(char) methods in the Character class were augmented with isFoo(int) overload siblings using an int to store a code point value.

Previously, one way to iterate through the character values of a String was to look at each char value in turn:

String s = ...

for(int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    // Process c...
}
Now, the chars need to be considered as possible members of a surrogate pair representing a single code point. Currently the canonical loop for this operation is:
String s = ...

for(int cp, i = 0; i < s.length(); i += Character.charCount(cp)) {
    cp = s.codePointAt(i);
    // Process cp...
}
At present, there is no direct API support for getting an iterator of the code point values from a String or CharSequence; perhaps one will be added in the future.

Glossary of Unicode terms

Comments:

How about this:
import static java.lang.Character.charCount;
import static java.lang.Character.codePointAt;

public class CodePointIterator implements Iterator<Integer> {
	private final char[] chars;
	private int next;
	public CodePointIterator(CharSequence cs) {
		chars = cs instanceof String
				? ((String) cs).toCharArray()
				: cs.toString().toCharArray();
	} 
	public boolean hasNext() {
		return next < chars.length;
	}
	public Integer next() {
		if (next > chars.length) throw new NoSuchElementException();
		int nc = codePointAt(chars, next);
		next += charCount(nc);
		return nc;
	}
	public void remove() {
		throw new UnsupportedOperationException();
	}
}
or something similar. But there's a pretty heavy autoboxing overhead if your strings are large.

Posted by Ian Phillips on September 19, 2006 at 06:00 AM PDT #

Ian, yes, your code is similar to the code proposed in bug 5003547, "add support for iterating over the codepoints in a string."

Extracting a String from the CharSequence is good to avoid issues with the argument CharSequence being mutated as it is iterated over. However, the instanceof check for String is unnecessary since String.toString() just returns this so the method is very fast.

If the characters of the character sequence are mostly in the ASCII range, then the autoboxing semantics will require cached Integer objects to be returned.

Posted by Joe Darcy on September 20, 2006 at 05:20 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

darcy

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
News

No bookmarks in folder

Blogroll