Wednesday Aug 10, 2011

Is ZipInput/OutputStream handling the data descriptor wrong for big entries?

There is a bug report came in recently suggesting that the new ZIP64 support in JDK7 might not be implemented correctly. The submitter quoted the "data descriptor" section of  the  ZIP spec

      When compressing files, compressed and uncompressed sizes
      should be stored in ZIP64 format (as 8 byte values) when a
      files size exceeds 0xFFFFFFFF.   However ZIP64 format may be
      used regardless of the size of a file.  When extracting, if
      the zip64 extended information extra field is present for
      the file the compressed and uncompressed sizes will be 8
      byte values.

and suggested "This means the sizes are eight byte if there is a ZIP64 extended information extra field and four bytes if there is none.  This is not what implements, ZipOutputStream#writeEXT writes eight byte data if one of the sizes exceeds 0xFFFFFFFF but it never writes any ZIP64 extended information extra field.  This means conforming implementations will "think" the sizes are four bytes while in fact they are eight bytes."

Since the submitter did not leave us a valid email address to contact and the bug (#7073588) does not appear to be pushed out to our external bug website promptly, while I also believe this might be interested for anyone who might be using the new ZIP64 support in JDK7's  package. I posted my evaluation here. The "spec" referred below is the ZIP file format specification APPNOTE.TXT from PKWare.

The "compressed/uncompressed size" part of the loc spec states

      If bit 3 of the general purpose bit flag is set,
      these fields are set to zero in the local header and the
      correct values are put in the data descriptor and
      in the central directory.  If an archive is in ZIP64 format
      and the value in this field is 0xFFFFFFFF, the size will be
      in the corresponding 8 byte ZIP64 extended information
      extra field.

and the ZIP64 Information Extra Field (0x0001) spec says

      The following is the layout of the zip64 extended
      information "extra" block. If one of the size or
      offset fields in the Local or Central directory
      record is too small to hold the required data,
      a Zip64 extended information record is created.
      The order of the fields in the zip64 extended
      information record is fixed, but the fields will
      only appear if the corresponding Local or Central
      directory record field is set to 0xFFFF or 0xFFFFFFFF.
      This entry in the Local header must include BOTH original
      and compressed file size fields.

The above spec appears to say three things here

(1) If the loc size and csize  are to be stored in data descriptor (when the general purpose flag bit 3 is set), these fields are set to

(2) If this archive is in ZIP64 format (what does this really mean? one possible interpretation is that there is ZIP64 extention appears at the "extra field" of this loc) AND these 2 fields are 0xFFFFFFFF, then the corresponding size/csize can be found at the ZIP64 extention in the extra field.

(3) In order to have size and csize appears in ZIP64 extended info extra field, their corresponding fields in loc MUST be 0xffffffff. Which means if the csize/size have to be present in loc's ZIP64 extra field (>4G), the size/csize fields in this loc MUST be 0xffffffff.

Here is the problem, if the bit 3 of the general purpose flag is set, therefor the size and csize fields in loc MUST be ZERO, then  (3) can NOT be true.

And from implementation point view, the reason why we have the "data description" is mostly because you don't know the value of size and csize yet when writing the loc (such as in the streaming case), it really does not make sense to have a zip64 extended info extra field as well, which is part of the loc, and you still don't know the size/ csize values when writing it.

That said, this is obviously contradicting to what is specified in the extracting part of the "data descriptor" spec, as quoted,

      When compressing files, compressed and uncompressed sizes
      should be stored in ZIP64 format (as 8 byte values) when a
      files size exceeds 0xFFFFFFFF.   However ZIP64 format may be
      used regardless of the size of a file.  When extracting, if
      the zip64 extended information extra field is present for
      the file the compressed and uncompressed sizes will be 8
      byte values.

which says you CAN have a ZIP64 extended info extra field in a loc (sizeZ&csize must be 0xffffffff), even if the bit 3 of the general flag is set (size&csize must be 0).

I don't have an answer for this, so I sent PKWare, the owner of the ZIP format specification, an email for clarification.

The clarification came back promptly as:

Thank you for your interest in the ZIP format.  I reviewed the APPNOTE and I believe the documentation should be updated for more clarity on this.  I will log a ticket to get further clarification on this record into a future version of the APPNOTE.  

To address your question, you would not use the Data Descriptor (presence is signaled using bit 3) at the same time as the ZIP64 Extended Information Extra Field (which uses the 0xFFFFFFFF value and "Extra Field" 0x0001).  When using the Data Descriptor, the values would be written as ZERO.  When alternatively, the ZIP64 extended information extra field is used, the values should be 0xFFFFFFFF.

I hope this helps with your understanding.  Please let me know if there is any additional information I can provide to you on this topic.  

It appears the suggestion is to NOT have both Data Descriptor and ZIP64 extended Information Extra Field at the "same time". Our implementation is doing exactly that, so I concluded this one as "not a defect" for now.

Friday May 27, 2011

The DeflaterOutputStream is now "flushable"

DeflaterOutStream is one of the classes that implements Flushable, what do you mean it is NOW flush-able?
Let's use the sample, which implements an Echo client that connects to the Echo server via socket and the Echo server simply receives data from its client and echoes it back, to illustrate what I meant "flush-able".

Start the server at localhost port 4444, as

sherman@sherman-linux:/tmp$ java Echo -server 4444
then the client
   sherman@sherman-linux:/tmp$ java Echo -client localhost 4444
   ECHO: Welcome to ECHO!
   Me  : hi there
   ECHO: hi there
   Me  : how are you doing?
   ECHO: how are you doing?
   Me  : exit

So far so good, everything works just as expected. Now let's take look at the code piece that we are interested. SocketIO is a wrapper class which wraps a connected Socket object, both the client and the server use this class to wrap their connected socket and then read and write bytes in and out of the underlying socket streams. SocketIO obtains its plain InputStream and OutputStream from the Socket as

     this.os = s.getOutputStream(); = s.getInputStream();

If, we want to be a little green, I mean to save a little bandwidth by compressing the bytes to be sent. Two ways to do that, you can use class directly to compress the bytes and then hand the compressed bytes to the output stream, and on the receiver side to use to decompress the receiving bits from the Socket input stream. Or wrap the OutputStream and InputStream from the Socket  with a pair of and as

    this.os = new DeflaterOutputStream(s.getOutputStream()); = new InflaterInputStream(s.getInputStream());

Let's try the later. Save, compile, re-start the server and, the client...

sherman@sherman-linux:/tmp$ java Echo -client localhost 4444

Oops, the client appears to hang, the expected "Welcome to ECHO" does not come in. The stacktrace dump indicates the connection has been established but the client is still waiting for that "Welcome to ECHO" handshake, while the server appears to have already sent them out.

  "main" prio=10 tid=0x08551c00 nid=0x597 runnable [0x00563000]
     java.lang.Thread.State: RUNNABLE
         at Method)
         at Echo$
         at Echo.main(
What happened? Where is our "Welcome to the ECHO"? Buffered somewhere in the pipe? We do invoke flush() to force flushing the output stream everytime after writing in SocketIO.write() as
    void write(String ln) throws IOException {
It turned out the anticipated bytes are "stuck" in the internal buffer of zlib's deflater, the deflater is patiently waiting/expecting more data to come to have a possible better compression. The spec of the DeflaterOutputStream.flush() says it "Flushes this output stream and forces any buffered output bytes to be written out to the stream", the implementation of previous JDK releases is only to flush the "underlying output stream", it does NOT flush the deflater, so if the data is being stuck in the deflater, the DeflaterOutputStream.flush() can NOT force them out. Too bad:-) There is no way to flush deflater?
If you take a look at zlib's doc/code, zlib actually does have 4 different flush modes when deflating:
  • Z_NO_FLUSH The deflater may accumulate/pend/hold input data in its internal buffer and wait for more input for better/best compression (in which the compressed data output might not be a "complete" block, so the inflater that works on these output data can not "decompress" them)
  • Z_SYNC_FLUSH All pending output is flushed to the output buffer and the output is aligned on a byte boundary, so that the inflater can get all input data available so far.
  • Z_FULL_FLUSH All output is flushed as with Z_SYNC_FLUSH, and the compression state is reset so that decompression can restart from this point if previous compressed data has been damaged or if random access is desired
  • Z_FINISH Pending input is processed, pending output is flushed
The Z_SYNC_FLUSH is exactly the one we need in our Echo case. Unfortunately until now the Java zlib implementation,, provides Z_NO_FLUSH as its only supported option for deflation, until "finish()" is invoked, in which it sets the flush to be Z_FINISH, you can NOT force the deflater to flush out its pending data. This is actually a well known problem and its bugid 4206909 had been on Java's Top 25 bug/rfe list for years.
The good news is that this bug has been finally fixed in JDK7

Now back to our Echo sample, the change is easy, simply do

    this.os = new DeflaterOutputStream(s.getOutputStream(), true); = new InflaterInputStream(s.getInputStream());

Save, compile, start the server and the client... WOW, everything backs to  normal, exactly the same as what we had before:-) except a little green. I'm sure you now know why I titled the blog as "The DeflaterOutputstream is now flushable!"

Thursday May 26, 2011

A Facelift for the Java Regular Expression Unicode support

As documented in the "Unicode Support" section of the latest JDK6 Pattern API Java regular expression supports the Unicode hex notation, Unicode properties like general category, block, uppercase, lowercase... the simple word boundaries, case mapping, supplementary characters, etc, and has been claiming to be "in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression, plus RL2.1 Canonical Equivalents" from the beginning (JDK 1.4),  in which we believe should satisfy most of the normal developer requirements for Unicode support. However early this year Tom Christiansen took some time from his busy Perl 5.14 schedule to perform a comprehensive reality check of the java Unicode support, focusing on the regular expression. Most of Tom's reports and the follow up discussions between Tom and me can be found at the OpenJDK's i18n-dev mailing archives. To summarize,

  • Though arguably the existing Unicode escape sequence for UTF-16, the \nXXXX notation, is indeed a mechanism/notation to specify the Unicode code point  in Java regular expression,  the "newly" added Perl \x{...} is definitely a more convenient construct and serves better to meet the RL1.1 Hex Nortation requirement.

  • The predefined POSIX character classes, such as \p{Upper}, \p{Lower}, \p{Space}, \p{Alpha}, etc. are all ASCII only version in j.u.regex, which obviously does not meet the Unicode guideline RL1.2a Compatibility Properties. (btw, Perl has already evolved/migrated from their early ASCII only implementation to the Unicode version years ago)

  • Some of those most frequently used/referred (and requested by the guideline RL1.2 Properties ) properties such as Script, Alphabetic, Uppercase, Lowercase and White_Space (via {javaUppercase}, \p{javaLowercase}, \p{Whitespace}) are either missing or have "slightly" different interpretation compared to the standard. One of the reasons causes this situation  is that Java did not have the access (in existing implementation) to those properties defined in PropList.txt, so for example, the Alphabetic is complete missing, and Character.isLowerCase only takes GC=LOWERCASE_LLETTER codepoint as "lowercase", does not count the Other_Lowercase.

  • The behaviors of the predefined character classes \b, \B and \w \W in  Java regex are "broken", use Tom's words, they break the historical connection between the word caracters \w and the word boundaries \b, for "no sound reason". Because \w is defined as a ASCII only version as "A word character: [a-zA-Z_0-9] ", while the implementation \b is a "semi-" Unicode-aware version. The result, as showed in the sample Tom gived, is that "élève" is NOT matched by pattern \b\w+\b. So the RL1.4 Simple Word Boundaries requirement is obvious NOT met.

  • Too many "errors" in the Perl related wording in the j.u.regex.Pattern doc, it's way over-due for an update. 

It's an embarrassing reality. So while JDK7 was at a very late stage of the release circle, we were convinced that it might be worth taking some risks to give the regex Unicode support a face lift to make the situation better (at this late release stage). Here is a summary of what we have done so far, we believe with these changes now the Unicode support in RegEx is in a much better shape.

The support of the new Perl style \x{...} construct

To use Java Unicode escape sequences notation \uXXXX was the only way to specify a Unicode code point in Java regex before JDK7. It's really not convenient and straightforward when you have to deal with supplementary character, in which two consecutive Unicode sequences (as the surrogate pair) have to be used, for example,

    int codePoint = ...;
    String hexPattern = codePoint <= 0xFFFF
                        ? String.format("\\u%04x", codePoint)
                        : String.format("\\u%04x\\u%04x",

compared to simply do

  String hexPattern = "\x{" + Integer.toHexString(codePoint) + "}";

A new flag UNICODE_CHARACTER_CLASS is added to support Unicode version of POSIX character classes

With this flag set, the ASCII only predefined character classes (\b \B \w \W \d \D \s \S) and POSIX character classes (\p{alpha}, \p{lower}, \{upper}...) are flipped to Unicode version*, as documented in the latest JDK7 j.u.regex.Pattern API doc. And this flag also enables the UNICODE_CASE, if the new flag UNICODE_CHARACTER_CLASS is specified, anything Unicode). While ideally I would like to just evolve/upgrade the Java regex from the aged ASCII-only to Unicode (maybe a ASCII_ONLY_COMPATIBILITY mode as a fallback), like what Perl did years ago, ut given the Java's "compatibility" spirit (and the performance concern as well), this is unlikely to be accepted.

* This addressed the \b\w+\b various issues

\p{IsBinaryProperty} is introduced in to explicitly support Unicode binary properties

  • Alphabetic

  • Ideographic

  • Letter

  • Lowercase

  • Uppercase

  • Titlecase

  • Punctuation

  • Control

  • White_Space

  • Digit

  • Hex_Digit

  • Noncharacter_Code_Point

  • Assigned

The spec and implementation of both j.l.C.isLowerCase() and j.l.C.isUpperCase() have been also updated to match the Unicode Standard definition of "lowercase" (LOWERCASE_LETTER + Other_Lowercase) and "uppercase" (UPPERCASE_LETTER + Other_Uppercase). And we have two more friends j.l.C.isIdeographic() and j.l.C.isAlphabetic(). All above Unicode properties now fully follow the Unicode Standard.

Unicode Scripts are now supported

Scripts are specified either with the prefix Is, as in \p{IsHiragana}, or by using the script keyword (or its short form sc) as in \p{script=Hiragana} or \p{sc=Hiragana}. The script names supported by Pattern are the valid script names accepted and defined by UnicodeScript.forName. (Yes, the j.l.C.UnicodeScript is also new, might write a separate blog for that).

And yes, we have corrected those errors in the Perl related document, with Tom's help.

Sure, most of enhancements mentioned here are probably "who cares" for most developers:-) but if you have read this far, I guess you do care, so here is the latest JDK7 j.u.regex API doc.  And...yes, the \N{...}, \X are on my to-do list:-)

And a BIG THANKS to Tom Christiansen!!!

Wednesday May 25, 2011

The ZIP filesystem provider in JDK7

It started as a NIO2 demo project with two goals. First, to demonstrate how to use NIO2 file system provider interface to develop and deploy a custom file system, a ZIP file system. Second, to show how easy and fun to use the NIO2 file system APIs, to access a ZIP/Jar file. I happened to have some code around which was prepared for a possible future class update, so we wrapped it with the NIO2 file system provider SPI, packed it into zipfs.jar and dropped it into the <JDK>/demo/nio/zipfs directory, done! OK, it took weeks:-) to pull everything together, clean up all the corner cases here and there, testing, and it ended up with 5K+ lines of code. You can find all the source code at <JDK7>/demo/nio/zipfs/ of your latest JDK7 directory, or "here" if you prefer a quick browsing. Next,  need come up with some sample code to achieve the second goal, how easy and fun to use the APIs. After writing down couple samples, wow,  we started to realize that it's truly easy now to access the ZIP file via the nio2 file system APIs, for example, to extract a ZIP entry SRC_NAME out of a ZIP file FILE_NAME, you only need to do

        try (FileSystem fs = FileSystems.newFileSystem(Paths.get(FILE_NAME), null)) {
           Files.copy(fs.getPath(SRC_NAME), Paths.get(DST_NAME);

or to do a "fancy nio2 walk" on the ZIP file,

        try (FileSystem fs = FileSystems.newFileSystem(Paths.get(FILE_NAME), null)) {
            Files.walkFileTree(fs.getPath(DIR_NAME),  new SimpleFileVisitor<Path>() {
                    private int indent = 0;
                    private void perform(Path file, BasicFileAttributes attrs) {
                        if (attrs.isDirectory())
                            System.out.printf("%" + (indent==0?"":indent<<1) + "s[%s]%n",
                            System.out.printf("%" + (indent==0?"":indent<<1) + "s%s%n",
                    public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) {
                        perform(file, attrs);
                        return FileVisitResult.CONTINUE;
                    public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs)  {
                        perform(dir, attrs);
                        indent += 2;
                        return FileVisitResult.CONTINUE;
                    public FileVisitResult postVisitDirectory(Path dir, IOException ioe) {
                        indent -= 2;
                        return FileVisitResult.CONTINUE;

or, you can even do something you have not been able to do via jar or those classes at, for example

        Files.copy(Paths.get(SRC), fs.getPath(DST), COPY_ATTRIBUTES);

which copies a file into a ZIP file with creationTime and lastAccessTime attributes, below is a sample output that shows the "" entry that was copied into the ZIP file with above line.


    creationTime    : Fri Apr 29 22:41:46 PDT 2011
    lastAccessTime  : Wed May 25 20:58:36 PDT 2011

    lastModifiedTime: Fri Apr 29 22:41:46 PDT 2011

    isRegularFile   : true

    isDirectory     : false

    isSymbolicLink  : false

    isOther         : false

    fileKey         : null

    size            : 26680

    compressedSize  : 4941

    crc             : c5f6eb5a

    method          : 8

More interesting and cool NIO2/ZIP filesystem usages can be found in

Sure, in order to make this work, you need to add the <JDK>/demo/nio/zipfs/zipfs.jar into your classpath or manually drop it into the lib/ext directory, if it's not there already.

The more test cases and sample code we wrote the more we are convinced that it might be a good idea to simply deploy this ZIP file system provider into the system extensions directory, so the provider can be used directly (without playing with the -classpath to add the zipfs.jar into your classpath) to access a ZIP/Jar file via the NIO2 file system APIs, as an alternative to the existing class. So since JDK7/b123, the zipfs.jar has been deployed into the lib/ext. You now can use the ZIP filesystem "out of the box" and access a ZIP/Jar file just like access a "normal" file system.

As of JDK7, the ZIP file system provider supports the legacy JAR URL syntax as defined by That is, entries in the zip/JAR file system will be identified by URIs of the form:

In addition, A ZIP file system can be created using URIs of the form:

The legacy JAR URL syntax will be a challenge to the platform once the latest URI RFE is adopted. We have decided to ignore this issue for now, as an alternative URI syntax would be confusing to developers and would be inconsistent with the rest of the platform.
For JDK 7, the zip/JAR file also must be located on the file system and so the URI scheme of "{uri}" in the above will be "file", eg:

When creating a new FileSystem properties may be used to configure the file system. These properties are specified to the FileSystems.newFileSystem methods via a java.util.Map. The key is the property name (String), and the value is the property value. The following 2 properties are now supported:

"create" : The value is of type java.lang.String with a default value of "false". If the value is "true" (case sensitive) then the zip or JAR file is created if it doesn't already exist.

"encoding" : The value is of type java.lang.String with a default value of "UTF-8". The value of the property indicates the encoding of the names of the entries in the Zip or JAR file.

You may want to give it a try if you happen to have some zips, jars around and let us know what works and what need improve.

Friday May 01, 2009

Non-UTF-8 encoding in ZIP file

The Zip specification (historically) does not specify what character encoding to be used for the embedded file names and comments, the original IBM PC character encoding set, commonly referred to as IBM Code Page 437, is supposed to be the only encoding supported. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.

Consequence? the ZIP file created by "traditional" ZIP tool is not accessible for java.util.jar/zip based tool, and vice versa, if the file name contains characters that are not compatible between Cp437 (as an alternative, tools might simply use the default platform encoding) and UTF-8

For most European, you're "lucky":-) that you only need to avoid a "handful" of characters, such as the umlauts (OK, I'm just kidding), but for Japanese and Chinese, most of the characters are simply out of luck. This is why bug 4244499 had been the No.1 on the Top 25 Java Bugs for so many years. The bug is no longer on the list:-) it has been finally "fixed" in OpenJDK 7, b57. I still keep a snapshot as the record/kudo for myself:-)

The solution (I would use "solution" than a "fix") in JDK7 b57 is to introduce in a new set of ZipInputStream ZipOutStream and ZipFile constructors with a specific "charset" as the parameter, as showed below.

ZipFile(File, Charset)
ZipInputStream(InputStream, Charset)
ZipOutputStream(OutputStream, Charset)

With these new constructors, applications can now access those non-UTF-8 ZIP files via ZipInputStream or ZipFile objects created with the specific encoding, or create a Zip files encoded in non-UTF-8 via the new ZipOutputStream(os, charset) constructor, if necessary.

zip is a stripped-down version of the Jar tool with a "-encoding" option to support non-UTF8 encoding for entry name and comment, it can serve as a demo for how to use the new APIs (I used it as a unit test). I'm still debating with myself if it is a good idea to officially introduce "-encoding" into the Jar tool...

Something you might want to keep in mind when use these new APIs and the new JDK7 bundles.

(1)The java.util.jar package is not touched, therefor there is no behavior change when accessing Jar and ZIP file via java.util.jar package (Jar is Jar, it uses UTF-8)

(2)UTF-8 is still used to decode the file names and comments if the general purpose flag bit 11 (EFS) is ON, even if a non-UTF-8 charset is specified in constructors. (See PKWare ZIP Spec for more detailed info regarding EFS).

(3)Jar and ZIP file created by JDK7 b57 and later now set the "general purpose flag bit 11" if UTF-8 encoding is used to encode the file name and comment.

(4)Since JDK7 b57 we switched to use the "standard" UTF-8 charset in java.util.jar/zip implementation, the earlier Java releases use a "modified" version of UTF-8. This is an in-compatible change for sure, but I strongly believe this is something worth doing.

Enjoy the APIs! Leave me a comment if you have any question, issue or problem.

Friday Apr 17, 2009

ZIP64, The Format for > 4G Zipfile, Is Now Supported

We heard you! finally:-)

The support for ZIP64, the format for > 4G ZIP file, has finally been added into the latest OpenJDK7 build(b55). This RFE (request for enhancements) had been buried in the 200+ Jar/ZIP bug/rfe pile so deep that I was not even aware of / remembered its existence until a recent call from A custom strongly asking for it. Given everyone now has 200G+ disk space (and yes, most of my kid's video clips are now around 1G after I got that new camcorder), it is relatively easy for the Jar/ZIP user to run into this 4G ceiling these days. The RFE was quickly climbing itself to the top of my to-do list and is now in b55. From now on only the sky is the limit for your Jar/ZIP file:-)

So if you have > 4G stuff (either the total size of files zipped in > 4G or the individual files themselves are > 4G) to jar/zip, try out the OpenJDK7-b55 (and later), either via the java.util.jar/zip  APIs or the Jar tool. Let me know if you find anything not working or broken, I hope it's perfect though:-)

Here is the brief background info regarding the 4G size problem in Jar and ZIP file.

(1)Various size and position offset related fields in original ZIP format are 4 bytes, so by nature ZIP has this 4G size limitation.

(2)The field for total number of files zipped/stored in ZIP's Central directory record is 2 bytes, so it has the 65536 limit (Java's ZIP file implementation has some hacky code to workaround this issue though)

(3)ZIP64 format extensions was introduced in (in spec 4.5) to address above size limitation.

(4)JDK7 now fully supports the ZIP64(tm) format extensions
defined by the PKWARE's  ZIP specification

If you are interested in source code, here is my ZIP64 "to-do" list (a copy/paste note from the spec) and the code diffs.

Tuesday Mar 10, 2009

The Overhaul of Java UTF-8 Charset

The UTF-8 charset implementation (in all JDK/JRE releases from Sun) has been updated recently to reject non-shortest-form UTF-8 byte sequences, since the old implementation might be leveraged in security attacks. Since then I have been asked many times about what this "non-shortest-form" issue is and what the possible impact might be, so here are the answers.

The first question usually goes "what is the non-shortest-form issue"? The detailed and official answer can be found at Unicode Corrigendum #1: UTF-8 Shortest Form. Put it in simple words, the problem is that the Unicode characters could be represented in more than one way (form) in the "UTF-8 encoding" that many people think/believe it is. When be asked what the UTF-8 encoding looks like, the easy/simple/brief explain would be the bit pattern showed below

# Bits Bit pattern
1 7 0xxxxxxx
2 11 110xxxxx 10xxxxxx
3 16 1110xxxx 10xxxxxx 10xxxxxx
4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

It's close, but it's actually WRONG, based on the latest definition/spec of UTF-8. The pattern above has a loophole that you can actually have more than one form to represent a Unicode character. For example, for ASCII characters from u+0000 to u+007f, the UTF-8 encoding form maintains transparency for all of them, so they keep their ASCII code values of 0x00..0x7f (in one-byte form) in UTF-8. However if based on the above pattern, These characters can also be represented in 2-bytes form as [c0, 80]..[c1, bf], the "non-shortest-form". The code below shows all of these non-shortest-2-bytes-form for these ASCII characters, if you run code against the "OLD" version of JDK/JRE.

byte[] bb = new byte[2];
for (int b1 = 0xc0; b1 < 0xc2; b1++) {
for (int b2 = 0x80; b2 < 0xc0; b2++) {
bb[0] = (byte)b1;
bb[1] = (byte)b2;
String cstr = new String(bb, "UTF8");
char c = cstr.toCharArray()[0];
System.out.printf("[%02x, %02x] -> U+%04x [%s]%n",
b1, b2, c & 0xffff, (c>=0x20)?cstr:"ctrl");

The output would be

[c0, a0] -> U+0020 [ ]
[c0, a1] -> U+0021 [!]
[c0, b6] -> U+0036 [6]
[c0, b7] -> U+0037 [7]
[c0, b8] -> U+0038 [8]
[c0, b9] -> U+0039 [9]
[c1, 80] -> U+0040 [@]
[c1, 81] -> U+0041 [A]
[c1, 82] -> U+0042 [B]
[c1, 83] -> U+0043 [C]
[c1, 84] -> U+0044 [D]

so for a string like "ABC" you would have two forms of UTF-8 sequences

"0x41 0x42 0x43" and "0xc1 0x81 0xc1 0x82 0xc1 0x83"

The Unicode Corrigendum #1: UTF-8 Shortest Form specifies explicitly that

"The definition of each UTF specifies the illegal code unit sequences in that UTF. For example, the definition of UTF-8 (D36) specifies that code unit sequences such as [C0, AF] are ILLEGAL."

Our old implementation accepts those non-shortest-form (while never generates them when encoding). The new UTF_8 charset now rejects the non-shortest-form byte sequences for all BMP characters, only the "legal byte sequences" listed below are accepted.

/\* Legal UTF-8 Byte Sequences
\* # Code Points Bits Bit/Byte pattern
\* 1 7 0xxxxxxx
\* U+0000..U+007F 00..7F
\* 2 11 110xxxxx 10xxxxxx
\* U+0080..U+07FF C2..DF 80..BF
\* 3 16 1110xxxx 10xxxxxx 10xxxxxx
\* U+0800..U+0FFF E0 A0..BF 80..BF
\* U+1000..U+FFFF E1..EF 80..BF 80..BF
\* 4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
\* U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
\* U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
\* U+100000..U10FFFF F4 80..8F 80..BF 80..BF

The next question would be "What would be the issue/problem if we keep using the old version of JDK/JRE?".

First, I'm not a lawyer...oops, I meant I'm not a security expert:-) so my word does not count. So we consulted with our security experts' opinion. Their conclusion is "it is NOT a security vulnerability in Java SE per se, but it may be leveraged to attack systems running software that relies on the UTF-8 charset to reject these non-shortest form of UTF-8 sequences".

A simple scenario that might give you an idea about what the above "may be leveraged to attack..." really means is

(1)A Java application would like to filter the incoming UTF-8 input stream to reject certain key words, for example "ABC"

(2)Instead of decoding the input UTF-8 byte sequences into Java char representation and then filter out the keyword string "ABC" at Java "char" level, for example,

String utfStr = new String(bytes, "UTF-8");
if ("ABC".equals(strUTF)) { ... }

The application might choose to filter the raw UTF-8 byte sequences "0x41 0x42 0x43" (only) directly against the UTF-8 byte input stream and then rely on (assume) the Java UTF-8 charset to reject any other non-shortest-form of the target keyword, if there is any.

(3)The consequence is the non-shortest form input "0xc1 0x81 0xc1 0x82 0xc1 0x83" will penetrate the filter and trigger a possible security vulnerability, if the underlying JDK/JRE runtime is an OLD version.

So the recommendation is to update to the latest JDK/JRE releases to avoid the potential risk.

Wait, there is also a big bonus for the updating. The UTF-8 charset implementation has not been updated/touched for years, given the fact that the UTF-8 encoding is so widely used (the default encoding for the XML and more and more websites use UTF-8 as their page encoding), we have been taking the "defensive" position of "don't change it if it works" the past years. So Martin and I decided to take this as an opportunity to give it a speed boost as well. The data below is taken from one of my benchmark (this is NOT an official benchmark, provided to give a rough idea of the performance boost) run data which compares the decoding/encoding operations of new implementation and old implementation under -server vm. The new implementation is much faster, especially when de/encoding the single bytes (those ASCIIs). The new decoding and encoding are faster under -client vm as well, but the gap is not as big as in -server vm,I wanted to show you the best:-)

Method Millis Millis(OLD)
Decoding 1b UTF-8 : 1786 12689
Decoding 2b UTF-8 : 21061 30769
Decoding 3b UTF-8 : 23412 44256
Decoding 4b UTF-8 : 30732 35909
Decoding 1b (direct)UTF-8 : 16015 22352
Decoding 2b (direct)UTF-8 : 63813 82686
Decoding 3b (direct)UTF-8 : 89999 111579
Decoding 4b (direct)UTF-8 : 73126 60366
Encoding 1b UTF-8 : 2528 12713
Encoding 2b UTF-8 : 14372 33246
Encoding 3b UTF-8 : 25734 26000
Encoding 4b UTF-8 : 23293 31629
Encoding 1b (direct)UTF-8 : 18776 19883
Encoding 2b (direct)UTF-8 : 50309 59327
Encoding 3b (direct)UTF-8 : 77006 74286
Encoding 4b (direct)UTF-8 : 61626 66517

The new UTF-8 charset implementation has been integrated in
JDK7, JDK6-open, JDK6-u11 and later, JDK5.0u17 and 1.4.2_19.

And if you are interested in what the change looks like, you can take a peek on the webrev of new for OpenJDK7.

Sunday Mar 08, 2009

Named Capturing Group in JDK7 RegEx

In a complicated regular expression, which might have many capturing and
non-capturing groups, the left to right group number counting might get a little confusing, and the expression itself (the group and its back
reference) becomes hard to understand and trace. A natural solution is
to, instead of counting it manually one by one and left to right, give each group a name, and then back reference it (in the same regex) or access the capturing match result (from MatchResult) by the assigned NAME, as what Python, PHP, .Net and Perl (5.10.0) do in their regex engines. This convenient feature has been missed in Java RegEx for years, now it finally got itself in JDK7 b50.

The newly added RegEx constructs to support the named capturing group are:

(1) (?<NAME>X) to define a named group NAME"                     
(2) \\k<Name> to backref a named group "NAME"                  
(3) <$<NAME> to reference to captured group in matcher's replacement str
(4) group(String NAME) to return the captured input subsequence by the given "named group"

With these new constructs, now you can write something like
    String pStr = "0x(?<bytes>\\\\p{XDigit}{1,4})\\\\s++u\\\\+(?<char>\\\\p{XDigit}{4})(?:\\\\s++)?";
    Matcher m = Pattern.compile(pStr).matcher(INPUTTEXT);
    if (m.matches()) {
        int bs = Integer.valueOf("bytes"), 16);
        int c =  Integer.valueOf("char"), 16);
        System.out.printf("[%x] -> [%04x]%n", bs, c);


    System.out.println("0x1234 u+5678".replaceFirst(pStr, "u+$<char> 0x$<bytes>"));

OK, examples above just show how to use these new constructs, not necessary mean they are "better":-) more "easy" to understand for me though.

The method group(String name) is NOT added into MatchResult interface for the compatibility concern ( personally I don't think it's a big deal, my guess is the majority of RegEx users can just live with the Matcher class, the compatibility weighs more here. Let me know otherwise).

Thursday Feb 19, 2009

Superduper slow jar?

It's well known that jar is a "little" slow:-) How slow? On my "aged" SunBlad1000, it takes about 1 minute and 40 seconds to jar the whole rt.jar in "cf0M" mode (no compress, no manifest), costs you a little more if in compress mode.

While we did feel it is a little bit "too slow", but then figured we are talking about jarring 10 thousands of classes with a total size of 50+M, given the number of files and the total size, it might just need that much of time. So it has been this slow for years and we never took sometime to figure out if it really needs that much of the time, until someone "accidentally" noticed that "the CPU went to 100% busy for quite some time, perhaps a minute or more on my laptop, before starting to hit the disk to create the jar archive".

That sounds strange, AFAIK the major job jar is supposed to do is to copy and compress the files (into the jar), it should hit the disk from the very beginning to the end. So I took a first peek into the jar source code after so many years, it turned out we had a very "embarrassing" bug in the jar code, we were doing a O(n) look-up on a Hashtable (via the contains() method) for each and every file we were jarring, in which it really should be a O(1) look-up operation with a HashSet. Given the number of files we are working on, this "simple" mistake costs us to spend majority of the time (that 1 min 40+ sec) in "collecting" the list of files that need to jar, instead of the real "jarring" work it is supposed to do, sigh:-(

With that fixed (in JDK7 build44 and later) the jar is much faster now. Below are the quick "time measuring" numbers of 10 runs of jarring/zipping the rt.jar/zip, in "store only" mode and "zip ompression" mode.

b43: the JDK7/build43, which does not have the fix.
b47: the JDK7/build47, which does have the fix.
zip: the default zip installed on my Solaris, which is zip2.3/1999

(1)jar cf0M / zip -r0q (simply "store" no zip compression)
  1:43.7   20.6   10.2
  1:40.3   20.2    9.2
  1:40.1   21.0    9.0
  1:40.5   19.6   10.4
  1:40.9   19.6    8.7
  1:40.2   19.6    9.1
  1:40.0   18.6   10.0
  1:39.1   20.0    8.6
  1:41.3   18.5    9.0
  1:42.1   19.6    9.6

(2)jar cfM/zip -rq (with zip compression)

  1:47.0   25.3   15.7
  1:45.9   23.4   14.2
  1:44.7   23.3   14.9
  1:45.4   23.7   14.3
  1:45.6   23.3   14.3
  1:44.9   23.6   14.0
  1:45.9   23.2   14.6
  1:44.0   23.0   14.2
  1:44.9   23.3   14.8
  1:45.8   23.5   14.2

Jar is making big progress and doing much much better, though is still slower compared to the "zip". So we will continue our "catch-up" going forward (I have to say it:-) actually I do have some code that make jar much closer to zip, but it will take a while to make it into the product)

  \*1 The fix now is only in JDK7.
  \*2 If you are interested in the detail of the fix, you can have a peek at here




« July 2016