Wednesday Aug 10, 2011

Is ZipInput/OutputStream handling the data descriptor wrong for big entries?


There is a bug report came in recently suggesting that the new ZIP64 support in JDK7 might not be implemented correctly. The submitter quoted the "data descriptor" section of  the  ZIP spec

      When compressing files, compressed and uncompressed sizes
      should be stored in ZIP64 format (as 8 byte values) when a
      files size exceeds 0xFFFFFFFF.   However ZIP64 format may be
      used regardless of the size of a file.  When extracting, if
      the zip64 extended information extra field is present for
      the file the compressed and uncompressed sizes will be 8
      byte values.

and suggested "This means the sizes are eight byte if there is a ZIP64 extended information extra field and four bytes if there is none.  This is not what java.util.zip implements, ZipOutputStream#writeEXT writes eight byte data if one of the sizes exceeds 0xFFFFFFFF but it never writes any ZIP64 extended information extra field.  This means conforming implementations will "think" the sizes are four bytes while in fact they are eight bytes."


Since the submitter did not leave us a valid email address to contact and the bug (#7073588) does not appear to be pushed out to our external bug website promptly, while I also believe this might be interested for anyone who might be using the new ZIP64 support in JDK7's java.util.zip  package. I posted my evaluation here. The "spec" referred below is the ZIP file format specification APPNOTE.TXT from PKWare.



The "compressed/uncompressed size" part of the loc spec states

      If bit 3 of the general purpose bit flag is set,
      these fields are set to zero in the local header and the
      correct values are put in the data descriptor and
      in the central directory.  If an archive is in ZIP64 format
      and the value in this field is 0xFFFFFFFF, the size will be
      in the corresponding 8 byte ZIP64 extended information
      extra field.

and the ZIP64 Information Extra Field (0x0001) spec says

      The following is the layout of the zip64 extended
      information "extra" block. If one of the size or
      offset fields in the Local or Central directory
      record is too small to hold the required data,
      a Zip64 extended information record is created.
      The order of the fields in the zip64 extended
      information record is fixed, but the fields will
      only appear if the corresponding Local or Central
      directory record field is set to 0xFFFF or 0xFFFFFFFF.
      ...
      This entry in the Local header must include BOTH original
      and compressed file size fields.

The above spec appears to say three things here

(1) If the loc size and csize  are to be stored in data descriptor (when the general purpose flag bit 3 is set), these fields are set to
ZERO.

(2) If this archive is in ZIP64 format (what does this really mean? one possible interpretation is that there is ZIP64 extention appears at the "extra field" of this loc) AND these 2 fields are 0xFFFFFFFF, then the corresponding size/csize can be found at the ZIP64 extention in the extra field.

(3) In order to have size and csize appears in ZIP64 extended info extra field, their corresponding fields in loc MUST be 0xffffffff. Which means if the csize/size have to be present in loc's ZIP64 extra field (>4G), the size/csize fields in this loc MUST be 0xffffffff.

Here is the problem, if the bit 3 of the general purpose flag is set, therefor the size and csize fields in loc MUST be ZERO, then  (3) can NOT be true.

And from implementation point view, the reason why we have the "data description" is mostly because you don't know the value of size and csize yet when writing the loc (such as in the streaming case), it really does not make sense to have a zip64 extended info extra field as well, which is part of the loc, and you still don't know the size/ csize values when writing it.

That said, this is obviously contradicting to what is specified in the extracting part of the "data descriptor" spec, as quoted,

      When compressing files, compressed and uncompressed sizes
      should be stored in ZIP64 format (as 8 byte values) when a
      files size exceeds 0xFFFFFFFF.   However ZIP64 format may be
      used regardless of the size of a file.  When extracting, if
      the zip64 extended information extra field is present for
      the file the compressed and uncompressed sizes will be 8
      byte values.

which says you CAN have a ZIP64 extended info extra field in a loc (sizeZ&csize must be 0xffffffff), even if the bit 3 of the general flag is set (size&csize must be 0).


I don't have an answer for this, so I sent PKWare, the owner of the ZIP format specification, an email for clarification.

The clarification came back promptly as:

--------------------------------------------------------
Thank you for your interest in the ZIP format.  I reviewed the APPNOTE and I believe the documentation should be updated for more clarity on this.  I will log a ticket to get further clarification on this record into a future version of the APPNOTE.  

To address your question, you would not use the Data Descriptor (presence is signaled using bit 3) at the same time as the ZIP64 Extended Information Extra Field (which uses the 0xFFFFFFFF value and "Extra Field" 0x0001).  When using the Data Descriptor, the values would be written as ZERO.  When alternatively, the ZIP64 extended information extra field is used, the values should be 0xFFFFFFFF.

I hope this helps with your understanding.  Please let me know if there is any additional information I can provide to you on this topic.  
---------------------------------------------------------

It appears the suggestion is to NOT have both Data Descriptor and ZIP64 extended Information Extra Field at the "same time". Our implementation is doing exactly that, so I concluded this one as "not a defect" for now.


Wednesday May 25, 2011

The ZIP filesystem provider in JDK7

It started as a NIO2 demo project with two goals. First, to demonstrate how to use NIO2 file system provider interface to develop and deploy a custom file system, a ZIP file system. Second, to show how easy and fun to use the NIO2 file system APIs, to access a ZIP/Jar file. I happened to have some code around which was prepared for a possible future j.u.zip.ZipFile class update, so we wrapped it with the NIO2 file system provider SPI, packed it into zipfs.jar and dropped it into the <JDK>/demo/nio/zipfs directory, done! OK, it took weeks:-) to pull everything together, clean up all the corner cases here and there, testing, and it ended up with 5K+ lines of code. You can find all the source code at <JDK7>/demo/nio/zipfs/src.zip of your latest JDK7 directory, or "here" if you prefer a quick browsing. Next,  need come up with some sample code to achieve the second goal, how easy and fun to use the APIs. After writing down couple samples, wow,  we started to realize that it's truly easy now to access the ZIP file via the nio2 file system APIs, for example, to extract a ZIP entry SRC_NAME out of a ZIP file FILE_NAME, you only need to do

        try (FileSystem fs = FileSystems.newFileSystem(Paths.get(FILE_NAME), null)) {
           Files.copy(fs.getPath(SRC_NAME), Paths.get(DST_NAME);
       }


or to do a "fancy nio2 walk" on the ZIP file,

        try (FileSystem fs = FileSystems.newFileSystem(Paths.get(FILE_NAME), null)) {
            Files.walkFileTree(fs.getPath(DIR_NAME),  new SimpleFileVisitor<Path>() {
                    private int indent = 0;
                    private void perform(Path file, BasicFileAttributes attrs) {
                        if (attrs.isDirectory())
                            System.out.printf("%" + (indent==0?"":indent<<1) + "s[%s]%n",
                                              "",
                                              file.toString());
                        else
                            System.out.printf("%" + (indent==0?"":indent<<1) + "s%s%n",
                                              "",
                                              file.getFileName().toString());
                    }
                    @Override
                    public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) {
                        perform(file, attrs);
                        return FileVisitResult.CONTINUE;
                    }
                    @Override
                    public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs)  {
                        perform(dir, attrs);
                        indent += 2;
                        return FileVisitResult.CONTINUE;
                    }
                    @Override
                    public FileVisitResult postVisitDirectory(Path dir, IOException ioe) {
                        indent -= 2;
                        return FileVisitResult.CONTINUE;
                    }
            });
        }


or, you can even do something you have not been able to do via jar or those classes at java.util.zip, for example

        Files.copy(Paths.get(SRC), fs.getPath(DST), COPY_ATTRIBUTES);

which copies a file into a ZIP file with creationTime and lastAccessTime attributes, below is a sample output that shows the "Demo.java" entry that was copied into the ZIP file with above line.


/Demo.java

    creationTime    : Fri Apr 29 22:41:46 PDT 2011
    lastAccessTime  : Wed May 25 20:58:36 PDT 2011

    lastModifiedTime: Fri Apr 29 22:41:46 PDT 2011

    isRegularFile   : true

    isDirectory     : false

    isSymbolicLink  : false

    isOther         : false

    fileKey         : null

    size            : 26680

    compressedSize  : 4941

    crc             : c5f6eb5a

    method          : 8

More interesting and cool NIO2/ZIP filesystem usages can be found in
<JDK7_HOME>/demo/nio/zipfs/Demo.java


Sure, in order to make this work, you need to add the <JDK>/demo/nio/zipfs/zipfs.jar into your classpath or manually drop it into the lib/ext directory, if it's not there already.

The more test cases and sample code we wrote the more we are convinced that it might be a good idea to simply deploy this ZIP file system provider into the system extensions directory, so the provider can be used directly (without playing with the -classpath to add the zipfs.jar into your classpath) to access a ZIP/Jar file via the NIO2 file system APIs, as an alternative to the existing java.util.zip/jar.ZipFile class. So since JDK7/b123, the zipfs.jar has been deployed into the lib/ext. You now can use the ZIP filesystem "out of the box" and access a ZIP/Jar file just like access a "normal" file system.

As of JDK7, the ZIP file system provider supports the legacy JAR URL syntax as defined by java.net.JarURLConnection. That is, entries in the zip/JAR file system will be identified by URIs of the form:
   jar:{uri}!/{entry}

In addition, A ZIP file system can be created using URIs of the form:
  jar:{uri}

The legacy JAR URL syntax will be a challenge to the platform once the latest URI RFE is adopted. We have decided to ignore this issue for now, as an alternative URI syntax would be confusing to developers and would be inconsistent with the rest of the platform.
For JDK 7, the zip/JAR file also must be located on the file system and so the URI scheme of "{uri}" in the above will be "file", eg:
  jar:file:/tmp/foo.jar
  jar:file:/tmp/foo.jar!/bar

When creating a new FileSystem properties may be used to configure the file system. These properties are specified to the FileSystems.newFileSystem methods via a java.util.Map. The key is the property name (String), and the value is the property value. The following 2 properties are now supported:

"create" : The value is of type java.lang.String with a default value of "false". If the value is "true" (case sensitive) then the zip or JAR file is created if it doesn't already exist.

"encoding" : The value is of type java.lang.String with a default value of "UTF-8". The value of the property indicates the encoding of the names of the entries in the Zip or JAR file.


You may want to give it a try if you happen to have some zips, jars around and let us know what works and what need improve.



Friday May 01, 2009

Non-UTF-8 encoding in ZIP file

The Zip specification (historically) does not specify what character encoding to be used for the embedded file names and comments, the original IBM PC character encoding set, commonly referred to as IBM Code Page 437, is supposed to be the only encoding supported. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.

Consequence? the ZIP file created by "traditional" ZIP tool is not accessible for java.util.jar/zip based tool, and vice versa, if the file name contains characters that are not compatible between Cp437 (as an alternative, tools might simply use the default platform encoding) and UTF-8

For most European, you're "lucky":-) that you only need to avoid a "handful" of characters, such as the umlauts (OK, I'm just kidding), but for Japanese and Chinese, most of the characters are simply out of luck. This is why bug 4244499 had been the No.1 on the Top 25 Java Bugs for so many years. The bug is no longer on the list:-) it has been finally "fixed" in OpenJDK 7, b57. I still keep a snapshot as the record/kudo for myself:-)

The solution (I would use "solution" than a "fix") in JDK7 b57 is to introduce in a new set of ZipInputStream ZipOutStream and ZipFile constructors with a specific "charset" as the parameter, as showed below.

ZipFile(File, Charset)
ZipInputStream(InputStream, Charset)
ZipOutputStream(OutputStream, Charset)

With these new constructors, applications can now access those non-UTF-8 ZIP files via ZipInputStream or ZipFile objects created with the specific encoding, or create a Zip files encoded in non-UTF-8 via the new ZipOutputStream(os, charset) constructor, if necessary.

zip is a stripped-down version of the Jar tool with a "-encoding" option to support non-UTF8 encoding for entry name and comment, it can serve as a demo for how to use the new APIs (I used it as a unit test). I'm still debating with myself if it is a good idea to officially introduce "-encoding" into the Jar tool...

Something you might want to keep in mind when use these new APIs and the new JDK7 bundles.


(1)The java.util.jar package is not touched, therefor there is no behavior change when accessing Jar and ZIP file via java.util.jar package (Jar is Jar, it uses UTF-8)

(2)UTF-8 is still used to decode the file names and comments if the general purpose flag bit 11 (EFS) is ON, even if a non-UTF-8 charset is specified in constructors. (See PKWare ZIP Spec for more detailed info regarding EFS).

(3)Jar and ZIP file created by JDK7 b57 and later now set the "general purpose flag bit 11" if UTF-8 encoding is used to encode the file name and comment.

(4)Since JDK7 b57 we switched to use the "standard" UTF-8 charset in java.util.jar/zip implementation, the earlier Java releases use a "modified" version of UTF-8. This is an in-compatible change for sure, but I strongly believe this is something worth doing.

Enjoy the APIs! Leave me a comment if you have any question, issue or problem.

Friday Apr 17, 2009

ZIP64, The Format for > 4G Zipfile, Is Now Supported

We heard you! finally:-)

The support for ZIP64, the format for > 4G ZIP file, has finally been added into the latest OpenJDK7 build(b55). This RFE (request for enhancements) had been buried in the 200+ Jar/ZIP bug/rfe pile so deep that I was not even aware of / remembered its existence until a recent call from A custom strongly asking for it. Given everyone now has 200G+ disk space (and yes, most of my kid's video clips are now around 1G after I got that new camcorder), it is relatively easy for the Jar/ZIP user to run into this 4G ceiling these days. The RFE was quickly climbing itself to the top of my to-do list and is now in b55. From now on only the sky is the limit for your Jar/ZIP file:-)

So if you have > 4G stuff (either the total size of files zipped in > 4G or the individual files themselves are > 4G) to jar/zip, try out the OpenJDK7-b55 (and later), either via the java.util.jar/zip  APIs or the Jar tool. Let me know if you find anything not working or broken, I hope it's perfect though:-)

Here is the brief background info regarding the 4G size problem in Jar and ZIP file.

(1)Various size and position offset related fields in original ZIP format are 4 bytes, so by nature ZIP has this 4G size limitation.

(2)The field for total number of files zipped/stored in ZIP's Central directory record is 2 bytes, so it has the 65536 limit (Java's ZIP file implementation has some hacky code to workaround this issue though)

(3)ZIP64 format extensions was introduced in (in spec 4.5) to address above size limitation.

(4)JDK7 now fully supports the ZIP64(tm) format extensions
defined by the PKWARE's  ZIP specification

If you are interested in source code, here is my ZIP64 "to-do" list (a copy/paste note from the spec) and the code diffs.

About

xuemingshen

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today