Friday May 27, 2011

The DeflaterOutputStream is now "flushable"

DeflaterOutStream is one of the classes that implements Flushable, what do you mean it is NOW flush-able?
Let's use the sample, which implements an Echo client that connects to the Echo server via socket and the Echo server simply receives data from its client and echoes it back, to illustrate what I meant "flush-able".

Start the server at localhost port 4444, as

sherman@sherman-linux:/tmp$ java Echo -server 4444
then the client
   sherman@sherman-linux:/tmp$ java Echo -client localhost 4444
   ECHO: Welcome to ECHO!
   Me  : hi there
   ECHO: hi there
   Me  : how are you doing?
   ECHO: how are you doing?
   Me  : exit

So far so good, everything works just as expected. Now let's take look at the code piece that we are interested. SocketIO is a wrapper class which wraps a connected Socket object, both the client and the server use this class to wrap their connected socket and then read and write bytes in and out of the underlying socket streams. SocketIO obtains its plain InputStream and OutputStream from the Socket as

     this.os = s.getOutputStream(); = s.getInputStream();

If, we want to be a little green, I mean to save a little bandwidth by compressing the bytes to be sent. Two ways to do that, you can use class directly to compress the bytes and then hand the compressed bytes to the output stream, and on the receiver side to use to decompress the receiving bits from the Socket input stream. Or wrap the OutputStream and InputStream from the Socket  with a pair of and as

    this.os = new DeflaterOutputStream(s.getOutputStream()); = new InflaterInputStream(s.getInputStream());

Let's try the later. Save, compile, re-start the server and, the client...

sherman@sherman-linux:/tmp$ java Echo -client localhost 4444

Oops, the client appears to hang, the expected "Welcome to ECHO" does not come in. The stacktrace dump indicates the connection has been established but the client is still waiting for that "Welcome to ECHO" handshake, while the server appears to have already sent them out.

  "main" prio=10 tid=0x08551c00 nid=0x597 runnable [0x00563000]
     java.lang.Thread.State: RUNNABLE
         at Method)
         at Echo$
         at Echo.main(
What happened? Where is our "Welcome to the ECHO"? Buffered somewhere in the pipe? We do invoke flush() to force flushing the output stream everytime after writing in SocketIO.write() as
    void write(String ln) throws IOException {
It turned out the anticipated bytes are "stuck" in the internal buffer of zlib's deflater, the deflater is patiently waiting/expecting more data to come to have a possible better compression. The spec of the DeflaterOutputStream.flush() says it "Flushes this output stream and forces any buffered output bytes to be written out to the stream", the implementation of previous JDK releases is only to flush the "underlying output stream", it does NOT flush the deflater, so if the data is being stuck in the deflater, the DeflaterOutputStream.flush() can NOT force them out. Too bad:-) There is no way to flush deflater?
If you take a look at zlib's doc/code, zlib actually does have 4 different flush modes when deflating:
  • Z_NO_FLUSH The deflater may accumulate/pend/hold input data in its internal buffer and wait for more input for better/best compression (in which the compressed data output might not be a "complete" block, so the inflater that works on these output data can not "decompress" them)
  • Z_SYNC_FLUSH All pending output is flushed to the output buffer and the output is aligned on a byte boundary, so that the inflater can get all input data available so far.
  • Z_FULL_FLUSH All output is flushed as with Z_SYNC_FLUSH, and the compression state is reset so that decompression can restart from this point if previous compressed data has been damaged or if random access is desired
  • Z_FINISH Pending input is processed, pending output is flushed
The Z_SYNC_FLUSH is exactly the one we need in our Echo case. Unfortunately until now the Java zlib implementation,, provides Z_NO_FLUSH as its only supported option for deflation, until "finish()" is invoked, in which it sets the flush to be Z_FINISH, you can NOT force the deflater to flush out its pending data. This is actually a well known problem and its bugid 4206909 had been on Java's Top 25 bug/rfe list for years.
The good news is that this bug has been finally fixed in JDK7

Now back to our Echo sample, the change is easy, simply do

    this.os = new DeflaterOutputStream(s.getOutputStream(), true); = new InflaterInputStream(s.getInputStream());

Save, compile, start the server and the client... WOW, everything backs to  normal, exactly the same as what we had before:-) except a little green. I'm sure you now know why I titled the blog as "The DeflaterOutputstream is now flushable!"

Friday Apr 17, 2009

ZIP64, The Format for > 4G Zipfile, Is Now Supported

We heard you! finally:-)

The support for ZIP64, the format for > 4G ZIP file, has finally been added into the latest OpenJDK7 build(b55). This RFE (request for enhancements) had been buried in the 200+ Jar/ZIP bug/rfe pile so deep that I was not even aware of / remembered its existence until a recent call from A custom strongly asking for it. Given everyone now has 200G+ disk space (and yes, most of my kid's video clips are now around 1G after I got that new camcorder), it is relatively easy for the Jar/ZIP user to run into this 4G ceiling these days. The RFE was quickly climbing itself to the top of my to-do list and is now in b55. From now on only the sky is the limit for your Jar/ZIP file:-)

So if you have > 4G stuff (either the total size of files zipped in > 4G or the individual files themselves are > 4G) to jar/zip, try out the OpenJDK7-b55 (and later), either via the java.util.jar/zip  APIs or the Jar tool. Let me know if you find anything not working or broken, I hope it's perfect though:-)

Here is the brief background info regarding the 4G size problem in Jar and ZIP file.

(1)Various size and position offset related fields in original ZIP format are 4 bytes, so by nature ZIP has this 4G size limitation.

(2)The field for total number of files zipped/stored in ZIP's Central directory record is 2 bytes, so it has the 65536 limit (Java's ZIP file implementation has some hacky code to workaround this issue though)

(3)ZIP64 format extensions was introduced in (in spec 4.5) to address above size limitation.

(4)JDK7 now fully supports the ZIP64(tm) format extensions
defined by the PKWARE's  ZIP specification

If you are interested in source code, here is my ZIP64 "to-do" list (a copy/paste note from the spec) and the code diffs.

Tuesday Apr 07, 2009

Faster new String(bytes, cs/csn) and String.getBytes(cs/csn)

String(byte[] bytes, String csn) and String.getBytes(String csn) pair (and their variants) have been widely used as the convenient methods for char[]/String<->byte[] encoding/decoding when

a) You don't want to get your hands "dirty" on the trivial decoding and/or encoding process (via java.nio.charset)  and
b) The default error handling action when there is/are un-mappable/malformed byte(s)/char(s) in the input array  is acceptable for your situation (though spec of some methods states the "behavior is unspecified", Sun JDK/JRE runtime always performs REPLACE in this case)

These methods are convenient, especially when dealing with "small" byte[] and char[].  However we have been hearing complains for years (including from our EE folks) that these methods are slow, especially when there is a SecurityManager installed, and also consume too much memory than "expected".

The complains are somewhat accurate. A simple new String(byte[] bb, String csn) takes a "long" journey down to the CharsetDecoder to do the real conversion as listed below

         ->at java.lang.String.<init>(
          ->at java.lang.String.<init>(
            ->at java.lang.StringCoding.decode(
              ->at java.lang.StringCoding$StringDecoder.decode(
                ->at java.nio.charset.CharsetDecoder.decode(
                  ->at sun.nio.cs.SingleByte$Decoder.decodeLoop(

during the process, ByteBuffer and CharBuffer objects are created to wrap the source byte[] and the destinating char[],  decoder takes the byte[] and char[] in and out of the wrappers during decoding,  a long list of sanity checks, defensive copying operations have to be performed... regardless lots of performance tuning have been done already, they are truely slow and bloat, a sad reality.

Good news is we finally found ourself some time to look into the issue recently. With lots of trivial but effective tuning works here and there including a "fast pass/green lane" for "built-in" charsets (the charsets included in JDK/JRE's charset respository) we are able to  cut the "bloat" memory use during the de/coding and boost the coding speed quite lot for "small" size of byte[]/char[]/String objects. So far the results look pretty good (listed below). The change has been integrated into JDK7 b53.

(1)NetBeans Memory profiles of running StrCodingBenchMark against JDK7 b52 and b53 show the "bloat" 36%+ java.nio.HeapByte/CharBuffer  objects are completely gone.

b52 Result:

b53 result:

2)Speed: Out micro performance benchmark StrCodingBenchmark shows huge speed boost for small byte[]/String use scenario. Below is the result of using ISO-8859-1 charset for different de/encoding operations with different sizes of input and under different runtime configs.

[-server] server vm   [-client]  client vm   [+sm] with a SecurityManager installed
[String decode: csn] new String(byte[], "iso-8859-1")
[String decode: cs ] new String(byte[], Charset.forName("iso-8859-1"))
[String encode: csn] String.getBytes("iso-8859-1")
[String encode: cs ] String.getBytes(Charset.forName("iso-8859-1"))
\*The first 4 columns are JDKb53, the later 4 are results of b52, smaller the number is the faster

-server -client -server+sm -client+sm ------------b52/old-----------------
String decode: csn 29 52 27 51 61 140 117 215
String decode: cs 38 73 93 147 90 160 139 237
String encode: csn 23 43 24 43 64 116 135 188
String encode: cs 42 77 98 134 170 341 232 418
String decode: csn 31 60 32 59 65 148 118 225
String decode: cs 40 81 97 155 95 170 143 247
String encode: csn 27 51 27 49 65 125 136 197
String encode: cs 46 84 101 139 174 356 238 451
String decode: csn 36 72 35 71 68 164 122 241
String decode: cs 45 94 101 169 108 189 147 263
String encode: csn 31 63 31 63 68 142 141 214
String encode: cs 50 96 103 152 179 379 242 453
String decode: csn 47 101 46 101 76 197 131 274
String decode: cs 55 119 113 194 115 228 157 303
String encode: csn 40 89 40 89 78 173 149 248
String encode: cs 58 121 112 178 194 423 258 497
String decode: csn 68 152 66 159 97 262 149 339
String decode: cs 74 171 133 246 139 302 179 379
String encode: csn 56 140 56 139 94 243 165 316
String encode: cs 73 171 128 227 223 521 285 582
String decode: csn 108 256 104 254 130 389 189 465
String decode: cs 110 272 173 345 184 449 222 522
String encode: csn 91 238 88 237 126 371 195 458
String encode: cs 106 267 162 324 283 672 342 745
String decode: csn 187 462 179 459 197 645 242 721
String decode: cs 183 474 249 548 277 739 305 811
String encode: csn 171 437 150 435 190 642 257 712
String encode: cs 171 462 224 519 395 1021 447 1093
String decode: csn 349 877 332 874 335 1156 368 1229
String decode: cs 334 879 405 961 463 1349 476 1408
String encode: csn 284 830 277 828 317 1168 379 1240
String encode: cs 298 865 350 910 636 1698 673 1769
String decode: csn 782 1708 742 1701 660 2176 665 2262
String decode: cs 741 1689 834 1785 865 2507 846 2568
String encode: csn 571 1631 555 1616 577 2228 631 2299
String encode: cs 559 1635 601 1691 1121 3051 1128 3118

Some more results are at benchmark.txt.

Two facts worth mentioning are
(1)The best gain achived when de/encoding "small" sizes of byte[]/char[]/String (when de/encoding on "big" chunk of data, the overhead we are trying to cut off becomes marginal)
(2)Using Charset as the parameter does not bring you any performance benefit (in fact it's slower and bloat) in most use scenarios, use it with caution.

For now all the SingleByte charsets (except JIS0201) in our repository are benefited from this change. I Hope I can find some time to work on the MultiByte charsets later, especially the UTF-8, yes, it's on my to-do list.

Tuesday Mar 24, 2009

Case-Insensitive Matching in Java RegEx

-Question1: Does Java RegEx support case-insensitive matching?
-Yes, it does.

-Question2: Can Java RegEx do "case-insensitive matching" on Non-ASCII text?
-Sure, Java RegEx supports not only the case-sensitive matching of characters in US-ASCII, Unicode case-folding is supported as well from day-one.

-Question3: Really? How to do that?
-It's actually fairly easy. They can be easily enabled by specifying the corresponding match flag(s), CASE_INSENSITIVE and UNICODE_CASE. For example

for ASCII only case-insensitive matching

    Pattern p = Pattern.compile("YOUR_REGEX", Pattern.CASE_INSENSITIVE);

or for Unicode case-folding matching

    Pattern p = Pattern.compile("YOUR_REGEX", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

For those String.matches(String regex) fans, you can use the Special constructs (?i) and (?iu) to "embed" the matching flag(s) in the regex body. for example


for ASCII case-insensitive matching, use (?iu) for Unicode case-folding.

And here is the bonus, you can also "turn off" the "flag(s)" by using (?-i).

-Question4: Cool, here is the tough one, does it realy work that way? I heard...
-.......OK, if you are using JDK6u2 (and later updates) or JDK7 the answer is "YES". But if you still use 6.0 or earlier versions, yes, we screwed it up "a little" in those releases:-( and we fixed it in JDK7 and have back-port the change into 6.0u2 already. So get the latest update now.

For those who want to know a little more details, here is the story.

The case folding spec in Java Regex clearly says

CASE_INSENSITIVE: By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag IN CONJUNCTION with this flag.

UNICODE_CASE: When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.

But our RegEx implementation in previous releases disagrees with our own spec:-(

(1)The flag UNICODE_CASE is mostly treated as "UNICODE_CASE_INSENSITIVE", means the matching is always case insensitive no matter the flag CASE_INSENSITIVE is enabled or not. The implementation only "accidently" follows the spec (match on case_sensitive) in character class case if the specified character is a basic Latin (ASCII) or a Latin-1 supplement(<=0xff).

(2)When CASE_INSENSITIVE is NOT together with a UNICODE_CASE, some Unicode case insensitive matching is still being done.

  a)Character Class Single with Latin-1 Supplement input
    regex "[\\u00e0]" matches input "\\u00c0"

  b)Character Class Range with any Non-ASCII input
    regex "[\\u00e0-\\u00e5]" matches "\\u00c2"
    regex "[\\u0430-\\u0431]" matches "\\u0411"

  c)Back Reference with any Non-ASCII input
    regex "(\\u00e0)\\\\1" matches "\\u00e0\\u00c0"
    regex "(\\u0430)\\\\1" matches "\\u0430\\u0410"

(3)The only place we "get it right" is the regex constructs for a single character (for example "\\u00e0" does not match "\\u00c0") or a slice of characters (for example "\\u00e0\\u00e1"
does not match "\\u00c0\\u00c1") do follow the spec to only allow ASCII characters to case insensitive match when only CASE_INSENSITIVE presented.

Above implementation inconsistency has been fixed in JDK7 and JDK6u2 to fully follow the specification.

\*\\u0000-\\u007f: Basic Latin (aka ASCII)
\*\\u0080-\\u00ff: Latin-1 Supplement
\*\\u0400-\\u04f9: Cyrillic

Tuesday Mar 10, 2009

The Overhaul of Java UTF-8 Charset

The UTF-8 charset implementation (in all JDK/JRE releases from Sun) has been updated recently to reject non-shortest-form UTF-8 byte sequences, since the old implementation might be leveraged in security attacks. Since then I have been asked many times about what this "non-shortest-form" issue is and what the possible impact might be, so here are the answers.

The first question usually goes "what is the non-shortest-form issue"? The detailed and official answer can be found at Unicode Corrigendum #1: UTF-8 Shortest Form. Put it in simple words, the problem is that the Unicode characters could be represented in more than one way (form) in the "UTF-8 encoding" that many people think/believe it is. When be asked what the UTF-8 encoding looks like, the easy/simple/brief explain would be the bit pattern showed below

# Bits Bit pattern
1 7 0xxxxxxx
2 11 110xxxxx 10xxxxxx
3 16 1110xxxx 10xxxxxx 10xxxxxx
4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

It's close, but it's actually WRONG, based on the latest definition/spec of UTF-8. The pattern above has a loophole that you can actually have more than one form to represent a Unicode character. For example, for ASCII characters from u+0000 to u+007f, the UTF-8 encoding form maintains transparency for all of them, so they keep their ASCII code values of 0x00..0x7f (in one-byte form) in UTF-8. However if based on the above pattern, These characters can also be represented in 2-bytes form as [c0, 80]..[c1, bf], the "non-shortest-form". The code below shows all of these non-shortest-2-bytes-form for these ASCII characters, if you run code against the "OLD" version of JDK/JRE.

byte[] bb = new byte[2];
for (int b1 = 0xc0; b1 < 0xc2; b1++) {
for (int b2 = 0x80; b2 < 0xc0; b2++) {
bb[0] = (byte)b1;
bb[1] = (byte)b2;
String cstr = new String(bb, "UTF8");
char c = cstr.toCharArray()[0];
System.out.printf("[%02x, %02x] -> U+%04x [%s]%n",
b1, b2, c & 0xffff, (c>=0x20)?cstr:"ctrl");

The output would be

[c0, a0] -> U+0020 [ ]
[c0, a1] -> U+0021 [!]
[c0, b6] -> U+0036 [6]
[c0, b7] -> U+0037 [7]
[c0, b8] -> U+0038 [8]
[c0, b9] -> U+0039 [9]
[c1, 80] -> U+0040 [@]
[c1, 81] -> U+0041 [A]
[c1, 82] -> U+0042 [B]
[c1, 83] -> U+0043 [C]
[c1, 84] -> U+0044 [D]

so for a string like "ABC" you would have two forms of UTF-8 sequences

"0x41 0x42 0x43" and "0xc1 0x81 0xc1 0x82 0xc1 0x83"

The Unicode Corrigendum #1: UTF-8 Shortest Form specifies explicitly that

"The definition of each UTF specifies the illegal code unit sequences in that UTF. For example, the definition of UTF-8 (D36) specifies that code unit sequences such as [C0, AF] are ILLEGAL."

Our old implementation accepts those non-shortest-form (while never generates them when encoding). The new UTF_8 charset now rejects the non-shortest-form byte sequences for all BMP characters, only the "legal byte sequences" listed below are accepted.

/\* Legal UTF-8 Byte Sequences
\* # Code Points Bits Bit/Byte pattern
\* 1 7 0xxxxxxx
\* U+0000..U+007F 00..7F
\* 2 11 110xxxxx 10xxxxxx
\* U+0080..U+07FF C2..DF 80..BF
\* 3 16 1110xxxx 10xxxxxx 10xxxxxx
\* U+0800..U+0FFF E0 A0..BF 80..BF
\* U+1000..U+FFFF E1..EF 80..BF 80..BF
\* 4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
\* U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
\* U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
\* U+100000..U10FFFF F4 80..8F 80..BF 80..BF

The next question would be "What would be the issue/problem if we keep using the old version of JDK/JRE?".

First, I'm not a lawyer...oops, I meant I'm not a security expert:-) so my word does not count. So we consulted with our security experts' opinion. Their conclusion is "it is NOT a security vulnerability in Java SE per se, but it may be leveraged to attack systems running software that relies on the UTF-8 charset to reject these non-shortest form of UTF-8 sequences".

A simple scenario that might give you an idea about what the above "may be leveraged to attack..." really means is

(1)A Java application would like to filter the incoming UTF-8 input stream to reject certain key words, for example "ABC"

(2)Instead of decoding the input UTF-8 byte sequences into Java char representation and then filter out the keyword string "ABC" at Java "char" level, for example,

String utfStr = new String(bytes, "UTF-8");
if ("ABC".equals(strUTF)) { ... }

The application might choose to filter the raw UTF-8 byte sequences "0x41 0x42 0x43" (only) directly against the UTF-8 byte input stream and then rely on (assume) the Java UTF-8 charset to reject any other non-shortest-form of the target keyword, if there is any.

(3)The consequence is the non-shortest form input "0xc1 0x81 0xc1 0x82 0xc1 0x83" will penetrate the filter and trigger a possible security vulnerability, if the underlying JDK/JRE runtime is an OLD version.

So the recommendation is to update to the latest JDK/JRE releases to avoid the potential risk.

Wait, there is also a big bonus for the updating. The UTF-8 charset implementation has not been updated/touched for years, given the fact that the UTF-8 encoding is so widely used (the default encoding for the XML and more and more websites use UTF-8 as their page encoding), we have been taking the "defensive" position of "don't change it if it works" the past years. So Martin and I decided to take this as an opportunity to give it a speed boost as well. The data below is taken from one of my benchmark (this is NOT an official benchmark, provided to give a rough idea of the performance boost) run data which compares the decoding/encoding operations of new implementation and old implementation under -server vm. The new implementation is much faster, especially when de/encoding the single bytes (those ASCIIs). The new decoding and encoding are faster under -client vm as well, but the gap is not as big as in -server vm,I wanted to show you the best:-)

Method Millis Millis(OLD)
Decoding 1b UTF-8 : 1786 12689
Decoding 2b UTF-8 : 21061 30769
Decoding 3b UTF-8 : 23412 44256
Decoding 4b UTF-8 : 30732 35909
Decoding 1b (direct)UTF-8 : 16015 22352
Decoding 2b (direct)UTF-8 : 63813 82686
Decoding 3b (direct)UTF-8 : 89999 111579
Decoding 4b (direct)UTF-8 : 73126 60366
Encoding 1b UTF-8 : 2528 12713
Encoding 2b UTF-8 : 14372 33246
Encoding 3b UTF-8 : 25734 26000
Encoding 4b UTF-8 : 23293 31629
Encoding 1b (direct)UTF-8 : 18776 19883
Encoding 2b (direct)UTF-8 : 50309 59327
Encoding 3b (direct)UTF-8 : 77006 74286
Encoding 4b (direct)UTF-8 : 61626 66517

The new UTF-8 charset implementation has been integrated in
JDK7, JDK6-open, JDK6-u11 and later, JDK5.0u17 and 1.4.2_19.

And if you are interested in what the change looks like, you can take a peek on the webrev of new for OpenJDK7.

Sunday Mar 08, 2009

Named Capturing Group in JDK7 RegEx

In a complicated regular expression, which might have many capturing and
non-capturing groups, the left to right group number counting might get a little confusing, and the expression itself (the group and its back
reference) becomes hard to understand and trace. A natural solution is
to, instead of counting it manually one by one and left to right, give each group a name, and then back reference it (in the same regex) or access the capturing match result (from MatchResult) by the assigned NAME, as what Python, PHP, .Net and Perl (5.10.0) do in their regex engines. This convenient feature has been missed in Java RegEx for years, now it finally got itself in JDK7 b50.

The newly added RegEx constructs to support the named capturing group are:

(1) (?<NAME>X) to define a named group NAME"                     
(2) \\k<Name> to backref a named group "NAME"                  
(3) <$<NAME> to reference to captured group in matcher's replacement str
(4) group(String NAME) to return the captured input subsequence by the given "named group"

With these new constructs, now you can write something like
    String pStr = "0x(?<bytes>\\\\p{XDigit}{1,4})\\\\s++u\\\\+(?<char>\\\\p{XDigit}{4})(?:\\\\s++)?";
    Matcher m = Pattern.compile(pStr).matcher(INPUTTEXT);
    if (m.matches()) {
        int bs = Integer.valueOf("bytes"), 16);
        int c =  Integer.valueOf("char"), 16);
        System.out.printf("[%x] -> [%04x]%n", bs, c);


    System.out.println("0x1234 u+5678".replaceFirst(pStr, "u+$<char> 0x$<bytes>"));

OK, examples above just show how to use these new constructs, not necessary mean they are "better":-) more "easy" to understand for me though.

The method group(String name) is NOT added into MatchResult interface for the compatibility concern ( personally I don't think it's a big deal, my guess is the majority of RegEx users can just live with the Matcher class, the compatibility weighs more here. Let me know otherwise).




« April 2014