The Overhaul of Java UTF-8 Charset

The UTF-8 charset implementation (in all JDK/JRE releases from Sun) has been updated recently to reject non-shortest-form UTF-8 byte sequences, since the old implementation might be leveraged in security attacks. Since then I have been asked many times about what this "non-shortest-form" issue is and what the possible impact might be, so here are the answers.

The first question usually goes "what is the non-shortest-form issue"? The detailed and official answer can be found at Unicode Corrigendum #1: UTF-8 Shortest Form. Put it in simple words, the problem is that the Unicode characters could be represented in more than one way (form) in the "UTF-8 encoding" that many people think/believe it is. When be asked what the UTF-8 encoding looks like, the easy/simple/brief explain would be the bit pattern showed below

# Bits Bit pattern
1 7 0xxxxxxx
2 11 110xxxxx 10xxxxxx
3 16 1110xxxx 10xxxxxx 10xxxxxx
4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

It's close, but it's actually WRONG, based on the latest definition/spec of UTF-8. The pattern above has a loophole that you can actually have more than one form to represent a Unicode character. For example, for ASCII characters from u+0000 to u+007f, the UTF-8 encoding form maintains transparency for all of them, so they keep their ASCII code values of 0x00..0x7f (in one-byte form) in UTF-8. However if based on the above pattern, These characters can also be represented in 2-bytes form as [c0, 80]..[c1, bf], the "non-shortest-form". The code below shows all of these non-shortest-2-bytes-form for these ASCII characters, if you run code against the "OLD" version of JDK/JRE.

byte[] bb = new byte[2];
for (int b1 = 0xc0; b1 < 0xc2; b1++) {
for (int b2 = 0x80; b2 < 0xc0; b2++) {
bb[0] = (byte)b1;
bb[1] = (byte)b2;
String cstr = new String(bb, "UTF8");
char c = cstr.toCharArray()[0];
System.out.printf("[%02x, %02x] -> U+%04x [%s]%n",
b1, b2, c & 0xffff, (c>=0x20)?cstr:"ctrl");

The output would be

[c0, a0] -> U+0020 [ ]
[c0, a1] -> U+0021 [!]
[c0, b6] -> U+0036 [6]
[c0, b7] -> U+0037 [7]
[c0, b8] -> U+0038 [8]
[c0, b9] -> U+0039 [9]
[c1, 80] -> U+0040 [@]
[c1, 81] -> U+0041 [A]
[c1, 82] -> U+0042 [B]
[c1, 83] -> U+0043 [C]
[c1, 84] -> U+0044 [D]

so for a string like "ABC" you would have two forms of UTF-8 sequences

"0x41 0x42 0x43" and "0xc1 0x81 0xc1 0x82 0xc1 0x83"

The Unicode Corrigendum #1: UTF-8 Shortest Form specifies explicitly that

"The definition of each UTF specifies the illegal code unit sequences in that UTF. For example, the definition of UTF-8 (D36) specifies that code unit sequences such as [C0, AF] are ILLEGAL."

Our old implementation accepts those non-shortest-form (while never generates them when encoding). The new UTF_8 charset now rejects the non-shortest-form byte sequences for all BMP characters, only the "legal byte sequences" listed below are accepted.

/\* Legal UTF-8 Byte Sequences
\* # Code Points Bits Bit/Byte pattern
\* 1 7 0xxxxxxx
\* U+0000..U+007F 00..7F
\* 2 11 110xxxxx 10xxxxxx
\* U+0080..U+07FF C2..DF 80..BF
\* 3 16 1110xxxx 10xxxxxx 10xxxxxx
\* U+0800..U+0FFF E0 A0..BF 80..BF
\* U+1000..U+FFFF E1..EF 80..BF 80..BF
\* 4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
\* U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
\* U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
\* U+100000..U10FFFF F4 80..8F 80..BF 80..BF

The next question would be "What would be the issue/problem if we keep using the old version of JDK/JRE?".

First, I'm not a lawyer...oops, I meant I'm not a security expert:-) so my word does not count. So we consulted with our security experts' opinion. Their conclusion is "it is NOT a security vulnerability in Java SE per se, but it may be leveraged to attack systems running software that relies on the UTF-8 charset to reject these non-shortest form of UTF-8 sequences".

A simple scenario that might give you an idea about what the above "may be leveraged to attack..." really means is

(1)A Java application would like to filter the incoming UTF-8 input stream to reject certain key words, for example "ABC"

(2)Instead of decoding the input UTF-8 byte sequences into Java char representation and then filter out the keyword string "ABC" at Java "char" level, for example,

String utfStr = new String(bytes, "UTF-8");
if ("ABC".equals(strUTF)) { ... }

The application might choose to filter the raw UTF-8 byte sequences "0x41 0x42 0x43" (only) directly against the UTF-8 byte input stream and then rely on (assume) the Java UTF-8 charset to reject any other non-shortest-form of the target keyword, if there is any.

(3)The consequence is the non-shortest form input "0xc1 0x81 0xc1 0x82 0xc1 0x83" will penetrate the filter and trigger a possible security vulnerability, if the underlying JDK/JRE runtime is an OLD version.

So the recommendation is to update to the latest JDK/JRE releases to avoid the potential risk.

Wait, there is also a big bonus for the updating. The UTF-8 charset implementation has not been updated/touched for years, given the fact that the UTF-8 encoding is so widely used (the default encoding for the XML and more and more websites use UTF-8 as their page encoding), we have been taking the "defensive" position of "don't change it if it works" the past years. So Martin and I decided to take this as an opportunity to give it a speed boost as well. The data below is taken from one of my benchmark (this is NOT an official benchmark, provided to give a rough idea of the performance boost) run data which compares the decoding/encoding operations of new implementation and old implementation under -server vm. The new implementation is much faster, especially when de/encoding the single bytes (those ASCIIs). The new decoding and encoding are faster under -client vm as well, but the gap is not as big as in -server vm,I wanted to show you the best:-)

Method Millis Millis(OLD)
Decoding 1b UTF-8 : 1786 12689
Decoding 2b UTF-8 : 21061 30769
Decoding 3b UTF-8 : 23412 44256
Decoding 4b UTF-8 : 30732 35909
Decoding 1b (direct)UTF-8 : 16015 22352
Decoding 2b (direct)UTF-8 : 63813 82686
Decoding 3b (direct)UTF-8 : 89999 111579
Decoding 4b (direct)UTF-8 : 73126 60366
Encoding 1b UTF-8 : 2528 12713
Encoding 2b UTF-8 : 14372 33246
Encoding 3b UTF-8 : 25734 26000
Encoding 4b UTF-8 : 23293 31629
Encoding 1b (direct)UTF-8 : 18776 19883
Encoding 2b (direct)UTF-8 : 50309 59327
Encoding 3b (direct)UTF-8 : 77006 74286
Encoding 4b (direct)UTF-8 : 61626 66517

The new UTF-8 charset implementation has been integrated in
JDK7, JDK6-open, JDK6-u11 and later, JDK5.0u17 and 1.4.2_19.

And if you are interested in what the change looks like, you can take a peek on the webrev of new for OpenJDK7.


Very interesting, having this change right in the JDK is really nice, since it automatically protects the entire stack, even the parts we might not have source for.

Posted by Rob McDermid on March 11, 2009 at 12:01 AM PDT #


please add your site at where other people can find you among the best sites on the internet!


Posted by Kris on March 11, 2009 at 02:36 AM PDT #

In your blog post you say: "The new UTF_8 charset now rejects the non-shortest-form byte sequences for all BMP characters, only the legal byte sequences listed below are accepted."

What happens when the non-shortest-form byte sequences are encountered? Does it throw an exception? Are they ignored? If they are ignored, do you end up with a valid UTF-8 String that contains all the legal characters but omits the illegal ones?

Posted by Mike Wertheim on March 13, 2009 at 08:14 AM PDT #

Here is a more succinct version of my question. Suppose the incoming input stream is "abcdefghi", where the "abc" and "ghi" are correctly encoded characters, but the "def" consists of incorrectly encoded characters. What does the resulting String look like? Do I get an exception? Do I just get "abcghi"? Or do I get something nonsensical (like "abc###ghi", where "###" is the result of mis-parsing the data)?

Posted by Mike Wertheim on March 14, 2009 at 02:02 AM PDT #

Mark, the "illegal byte sequences" are replaced by the default
replacement character u+fffd if new String(bytes, "UTF-8") is used to
decode the bytes, no exception is thrown for "illegal byte sequences"
(or malformed input, in java.nio.charset term). You can have more
"control" (ignore, exception or replace) on how to deal with those
malformed input over the decoding process via the APIs in java.nio.charset.


public String(byte[] bytes, Charset charset)

Constructs a new String by decoding the specified array of bytes using the
specified charset. The length of the new String is a function of the charset,
and hence may not be equal to the length of the byte array.

This method always replaces malformed-input and unmappable-character
sequences with this charset's default replacement string. The CharsetDecoder
class should be used when more control over the decoding process is required.

Posted by Xueming Shen on March 15, 2009 at 03:50 PM PDT #

What about the Java UTF-8 BOM issue? it is such a shameful bug and marked as wont fix.

Posted by ahmetaa on April 17, 2009 at 11:44 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed



« April 2014