Overhauling the Java UTF-8 charset

By Xueming Shen

The UTF-8 charset implementation, which is available in all JDK/JRE releases from Sun, has been updated recently to reject non-shortest-form UTF-8 byte sequences. This is because the old implementation might be leveraged in security attacks. Since then I have been asked many times about what this "non-shortest-form" issue is and what the possible impact might be, so here are some answers.

The first question usually goes: "What is the non-shortest-form issue"?

The detailed and official answer is at Unicode Corrigendum #1: UTF-8 Shortest Form. Simply put, the problem is that Unicode characters can be represented in more than one way (form) in the "UTF-8 encoding" than many people think or believe. When asked what UTF-8 encoding looks like, the simplest explanation would be the following bit pattern:

# Bits Bit pattern
1 7 0xxxxxxx      
2 11 110xxxxx 10xxxxxx    
3 16 1110xxxx 10xxxxxx 10xxxxxx  
4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The pattern is close, but it's actually wrong, based on the latest definition of UTF-8. The preceding pattern has a loophole in that you can actually have more than one form represent a Unicode character.

For ASCII characters from u+0000 to u+007f, for example, the UTF-8 encoding form maintains transparency for all of them, so they keep their ASCII code values of 0x00..0x7f (in one-byte form) in UTF-8. Based on the preceding pattern, however, these characters can also be represented in 2-bytes form as [c0, 80]..[c1, bf], the "non-shortest-form".

The following code shows all of the non-shortest-2-bytes-form for these ASCII characters, if you run code against the "old" version of the JDK and JRE (Java Runtime Environment).


    byte[] bb = new byte[2];
    for (int b1 = 0xc0; b1 < 0xc2; b1++) {
        for (int b2 = 0x80; b2 < 0xc0; b2++) {
            bb[0] = (byte)b1;
            bb[1] = (byte)b2; 
            String cstr = new String(bb, "UTF8");
            char c = cstr.toCharArray()[0];
            System.out.printf("[%02x, %02x] -> U+%04x [%s]%n",
                              b1, b2, c & 0xffff, (c>=0x20)?cstr:"ctrl");
        }
    }

The output would be as follows:


...
[c0, a0] -> U+0020 [ ]
[c0, a1] -> U+0021 [!]
...
[c0, b6] -> U+0036 [6]
[c0, b7] -> U+0037 [7]
[c0, b8] -> U+0038 [8]
[c0, b9] -> U+0039 [9]
...
[c1, 80] -> U+0040 [@]
[c1, 81] -> U+0041 [A]
[c1, 82] -> U+0042 [B]
[c1, 83] -> U+0043 [C]
[c1, 84] -> U+0044 [D]
...

So, for a string like "ABC", you would have two forms of UTF-8 sequences:


"0x41 0x42 0x43" and "0xc1 0x81 0xc1 0x82 0xc1 0x83"

The Unicode Corrigendum #1: UTF-8 Shortest Form specifies explicitly that "The definition of each UTF specifies the illegal code unit sequences in that UTF. For example, the definition of UTF-8 (D36) specifies that code unit sequences such as [C0, AF] are illegal."

Our old implementation accepts those non-shortest-form (while it never generates them when encoding). The new UTF_8 charset now rejects the non-shortest-form byte sequences for all BMP characters. Only the "legal byte sequences" listed below are accepted.


    /\*  Legal UTF-8 Byte Sequences
     \*
     \* #    Code Points      Bits   Bit/Byte pattern
     \* 1                     7      0xxxxxxx
     \*      U+0000..U+007F          00..7F
     \* 2                     11     110xxxxx    10xxxxxx
     \*      U+0080..U+07FF          C2..DF      80..BF
     \* 3                     16     1110xxxx    10xxxxxx    10xxxxxx
     \*      U+0800..U+0FFF          E0          A0..BF      80..BF
     \*      U+1000..U+FFFF          E1..EF      80..BF      80..BF
     \* 4                     21     11110xxx    10xxxxxx    10xxxxxx    10xxxxxx
     \*     U+10000..U+3FFFF         F0          90..BF      80..BF      80..BF
     \*     U+40000..U+FFFFF         F1..F3      80..BF      80..BF      80..BF
     \*    U+100000..U10FFFF         F4          80..8F      80..BF      80..BF
     \*/

The next question usually is: "What would be the issue/problem if we keep using the old version of JDK/JRE?"

First, I'm not a lawyer — oops, I meant I'm not a security expert:-) — so my word does not count. We consulted with our security experts instead. Their conclusion is that while "it is not a security vulnerability in Java SE per se, it may be leveraged to attack systems running software that relies on the UTF-8 charset to reject these non-shortest form of UTF-8 sequences".

A simple scenario that might give you an idea about what the above "may be leveraged to attack..." really means:

  1. A Java application would like to filter the incoming UTF-8 input stream to reject certain key words, for example "ABC".
  2. Instead of decoding the input UTF-8 byte sequences into Java char representation and then filter out the keyword string "ABC" at Java "char" level, for example:
    
           String utfStr = new String(bytes, "UTF-8");
           if ("ABC".equals(strUTF)) { ... }
    
    The application might choose to filter the raw UTF-8 byte sequences "0x41 0x42 0x43" (only) directly against the UTF-8 byte input stream and then rely on (assume) the Java UTF-8 charset to reject any other non-shortest-form of the target keyword, if there is any.
  3. The consequence is the non-shortest form input "0xc1 0x81 0xc1 0x82 0xc1 0x83" will penetrate the filter and trigger a possible security vulnerability, if the underlying JDK/JRE runtime is an OLD version.

So the recommendation is: Update to the latest JDK/JRE releases to avoid the potential risk.

Wait, there is another big bonus for updating: performance.

The UTF-8 charset implementation has not been updated or touched for years. UTF-8 encoding is very widely used as the default encoding for XML, and more and more websites use UTF-8 as their page encoding. Given that fact, we have taken the defensive position of "don't change it if it works" during the past years.

So Martin and I decided to take this opportunity to give it a speed boost as well. The following data is from one of my benchmark's run data, which compares the decoding/encoding operations of new implementation and old implementation under -server vm. (This is not an official benchmark: it is provided only to give a rough idea of the performance boost.)

The new implementation is much faster, especially when decoding or encoding single bytes (those ASCIIs). The new decoding and encoding are faster under -client vm as well, but the gap is not as big as in -server vm. I wanted to show you the best:-)


    Method Millis Millis(OLD)
    Decoding 1b UTF-8         :  1786  12689
    Decoding 2b UTF-8         : 21061  30769
    Decoding 3b UTF-8         : 23412  44256
    Decoding 4b UTF-8         : 30732  35909
    Decoding 1b (direct)UTF-8 : 16015  22352
    Decoding 2b (direct)UTF-8 : 63813  82686
    Decoding 3b (direct)UTF-8 : 89999 111579
    Decoding 4b (direct)UTF-8 : 73126  60366
    Encoding 1b UTF-8         :  2528  12713
    Encoding 2b UTF-8         : 14372  33246
    Encoding 3b UTF-8         : 25734  26000
    Encoding 4b UTF-8         : 23293  31629
    Encoding 1b (direct)UTF-8 : 18776  19883
    Encoding 2b (direct)UTF-8 : 50309  59327
    Encoding 3b (direct)UTF-8 : 77006  74286
    Encoding 4b (direct)UTF-8 : 61626  66517

The new UTF-8 charset implementation has been integrated in JDK7, Open JDK 6, JDK 6 update 11 and later, JDK5.0u17, and 1.4.2_19.

If you are interested in what the change looks like, you can take a peek at the webrev of the new UTF_8.java for OpenJDK7.

Xueming Shen is an engineer at Sun Microsystems, working in the Java core technologies group.

Comments:

This posting does not address another non-shortest (and incorrect) form of UTF-8, which is the use of UTF-8 to encode each UTF-16 half of a surrogate pair instead of the correct use of UTF-8 to encode the supplementary-plane character the surrogate pair represents.

Trivial example: representing UTF-32 0x00010000 as 0xEDA080EDB080 (encoding UTF-16 0xD800 and 0xDC00 as two separate characters), when the correct UTF-8 encoding is 0xF0908080.

Posted by John W Kennedy on March 16, 2009 at 07:28 AM PDT #

John,

Unicode Standard explicitly specifies that only those "non-shortest-forms" for BMP characters are "illegal", which should be rejected by the conformant implementation. Those "non-shortest-forms" for supplementary characters are still "legal" utf-8 byte sequences, based on the standard. We don't generate those forms, but mostly for the compatibility concern, have to accept them, as long as they are "legal".

You can find the details at
<Unicode Corrigendum #1: UTF-8 Shortest Form> quoted above, especially

(1)...To address this issue, the Unicode Technical Committee has modified the definition of UTF-8 to forbid conformant implementations from interpreting non-shortest forms for BMP characters, and clarified some of the conformance clauses.

(2)Table 3.1B. Legal UTF-8 Byte Sequences

-------------------------------
John W Kennedy wrote:

This posting does not address another non-shortest (and incorrect) form of UTF-8, which is the use of UTF-8 to encode each UTF-16 half of a surrogate pair instead of the correct use of UTF-8 to encode the supplementary-plane character the surrogate pair represents.

Trivial example: representing UTF-32 0x00010000 as 0xEDA080EDB080 (encoding UTF-16 0xD800 and 0xDC00 as two separate characters), when the correct UTF-8 encoding is 0xF0908080.

Posted by guest on March 16, 2009 at 08:20 AM PDT #

Is there a typo in this sentence:

"however, these characters can also be represented in 2-bytes form as [c0, 01]..[c1, bf], the "non-shortest-form".

I think this should be [c0, 10] not [c0, 01]. Otherwise the two byte sequence doesnt fit the mask pattern.

Posted by David Smith on April 29, 2009 at 09:36 PM PDT #

Actually I got it wrong too, it should I think be [c0, 80]

Posted by David Smith on April 29, 2009 at 09:39 PM PDT #

David, you're absolutely right on this, this is typo, the range should be [c0,80]..[c1, bf]. Thanks!

Posted by Xueming Shen on April 30, 2009 at 03:47 AM PDT #

The typo has been fixed.

Posted by Christine Dorffi on April 30, 2009 at 04:56 AM PDT #

John,

Can fing the UTF-8 characters.Unicode Standard explicitly specifies that only those "non-shortest-forms" for BMP characters are "illegal"

Posted by Alex on May 02, 2009 at 12:51 PM PDT #

Post a Comment:
Comments are closed for this entry.
About

John O'Conner

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today