Musings on JDK development

  • Java
    September 18, 2006

Iterating over the codepoints of a String

Recently I wanted to iterate over the code points of a String instead of its char values. Unicode 3.1 added supplementary characters, bringing the total number of characters to more than the 216 characters that can be distinguished by a single 16-bit char value. Therefore, a char value no longer has a one-to-one mapping to the fundamental semantic unit in Unicode. JDK 5 was updated to support the larger set of character values. Instead of changing the definition of the char type, some of the new supplementary characters are represented by a surrogate pair of two char values. To reduce naming confusion, a code point will be used to refer to the number that represents a particular Unicode character, including supplementary ones.

Therefore, a sequence of char values can be thought of as a variable-length encoding of a sequence of code points; the older characters (in the Basic Multilingual Plane) are represented by a single char value while the newer supplementary characters take two char values. The definition of language concepts, like identifiers, was rephrased in terms of code points instead of chars. The existing isFoo(char) methods in the Character class were augmented with isFoo(int) overload siblings using an int to store a code point value.

Previously, one way to iterate through the character values of a String was to look at each char value in turn:

String s = ...
for(int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
// Process c...

Now, the chars need to be considered as possible members of a surrogate pair representing a single code point. Currently the canonical loop for this operation is:
String s = ...
for(int cp, i = 0; i < s.length(); i += Character.charCount(cp)) {
cp = s.codePointAt(i);
// Process cp...

At present, there is no direct API support for getting an iterator of the code point values from a String or CharSequence; perhaps one will be added in the future.

Glossary of Unicode terms

Join the discussion

Comments ( 2 )
  • Ian Phillips Tuesday, September 19, 2006
    How about this:
    import static java.lang.Character.charCount;
    import static java.lang.Character.codePointAt;
    public class CodePointIterator implements Iterator<Integer> {

    private final char[] chars;

    private int next;

    public CodePointIterator(CharSequence cs) {

    chars = cs instanceof String

    ? ((String) cs).toCharArray()

    : cs.toString().toCharArray();


    public boolean hasNext() {

    return next < chars.length;


    public Integer next() {

    if (next > chars.length) throw new NoSuchElementException();

    int nc = codePointAt(chars, next);

    next += charCount(nc);

    return nc;


    public void remove() {

    throw new UnsupportedOperationException();


    or something similar. But there's a pretty heavy autoboxing overhead if your strings are large.
  • Joe Darcy Wednesday, September 20, 2006
    Ian, yes, your code is similar to the code proposed in bug 5003547, "add support for iterating over the codepoints in a string."

    Extracting a String from the CharSequence is good to avoid issues with the argument CharSequence being mutated as it is iterated over. However, the instanceof check for String is unnecessary since String.toString() just returns this so the method is very fast.

    If the characters of the character sequence are mostly in the ASCII range, then the autoboxing semantics will require cached Integer objects to be returned.

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha