Non-UTF-8 encoding in ZIP file

The Zip specification (historically) does not specify what character encoding to be used for the embedded file names and comments, the original IBM PC character encoding set, commonly referred to as IBM Code Page 437, is supposed to be the only encoding supported. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.

Consequence? the ZIP file created by "traditional" ZIP tool is not accessible for java.util.jar/zip based tool, and vice versa, if the file name contains characters that are not compatible between Cp437 (as an alternative, tools might simply use the default platform encoding) and UTF-8

For most European, you're "lucky":-) that you only need to avoid a "handful" of characters, such as the umlauts (OK, I'm just kidding), but for Japanese and Chinese, most of the characters are simply out of luck. This is why bug 4244499 had been the No.1 on the Top 25 Java Bugs for so many years. The bug is no longer on the list:-) it has been finally "fixed" in OpenJDK 7, b57. I still keep a snapshot as the record/kudo for myself:-)

The solution (I would use "solution" than a "fix") in JDK7 b57 is to introduce in a new set of ZipInputStream ZipOutStream and ZipFile constructors with a specific "charset" as the parameter, as showed below.

ZipFile(File, Charset)
ZipInputStream(InputStream, Charset)
ZipOutputStream(OutputStream, Charset)

With these new constructors, applications can now access those non-UTF-8 ZIP files via ZipInputStream or ZipFile objects created with the specific encoding, or create a Zip files encoded in non-UTF-8 via the new ZipOutputStream(os, charset) constructor, if necessary.

zip is a stripped-down version of the Jar tool with a "-encoding" option to support non-UTF8 encoding for entry name and comment, it can serve as a demo for how to use the new APIs (I used it as a unit test). I'm still debating with myself if it is a good idea to officially introduce "-encoding" into the Jar tool...

Something you might want to keep in mind when use these new APIs and the new JDK7 bundles.


(1)The java.util.jar package is not touched, therefor there is no behavior change when accessing Jar and ZIP file via java.util.jar package (Jar is Jar, it uses UTF-8)

(2)UTF-8 is still used to decode the file names and comments if the general purpose flag bit 11 (EFS) is ON, even if a non-UTF-8 charset is specified in constructors. (See PKWare ZIP Spec for more detailed info regarding EFS).

(3)Jar and ZIP file created by JDK7 b57 and later now set the "general purpose flag bit 11" if UTF-8 encoding is used to encode the file name and comment.

(4)Since JDK7 b57 we switched to use the "standard" UTF-8 charset in java.util.jar/zip implementation, the earlier Java releases use a "modified" version of UTF-8. This is an in-compatible change for sure, but I strongly believe this is something worth doing.

Enjoy the APIs! Leave me a comment if you have any question, issue or problem.

Comments:

Sounds like you've done well to grasp the nettle: no perfect solution possible, but you seem to have a good balance.

I'm sure that a few people will now have JAR files they can't read with the new scheme, so I think that you should ensure that you have the support for old-style modified UTF-8 ready to roll immediately else you might break working production systems with months and months of waiting for a fix.

Rgds

Damon

Posted by Damon Hart-Davis on May 01, 2009 at 06:07 PM PDT #

[Trackback] Windows上で作成されたzipをUN\*X上でどうしても展開できなかったことはないでしょうか?ざっとググってみましたが Ubuntu での対応が早かったようです。JavaとSolarisでの対応を確認してみました。 Javaでの対応は以下のブログに説明されています。jarコマンドでの対応はまだ確認できてませんが以下ブログ中のソースをJDK7でコンパイル、実行すれば動作確認できます。 Non-UTF-8 encoding in ZIP file : Xueming Shen's Blog ...

Posted by Let the Sunshine In on August 22, 2009 at 06:04 AM PDT #

Hi Xueming Shen
ZipInputStream zis = new ZipInputStream(new FileInputStream(zipFile),Charset.forName("UTF-8"));
I try to use this new constructor but the problem is still there. Any ideal on this?
Thank you very much

Posted by NghiaLuong on July 12, 2010 at 01:12 PM PDT #

in my understanding, to use java.util.zip to compress or decompress the file with the name of CJK is OK, however, if use java.util.zip to compress the directory,but decompressed with other tools may cause issue.
to resolve the issue, we just use java 7.0.
Am I right?
if so, how about the applications developed by java 6.0 or lower version?
thanks

Posted by guest on August 21, 2011 at 09:20 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

xuemingshen

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today