Wednesday Aug 10, 2011

Is ZipInput/OutputStream handling the data descriptor wrong for big entries?


There is a bug report came in recently suggesting that the new ZIP64 support in JDK7 might not be implemented correctly. The submitter quoted the "data descriptor" section of  the  ZIP spec

      When compressing files, compressed and uncompressed sizes
      should be stored in ZIP64 format (as 8 byte values) when a
      files size exceeds 0xFFFFFFFF.   However ZIP64 format may be
      used regardless of the size of a file.  When extracting, if
      the zip64 extended information extra field is present for
      the file the compressed and uncompressed sizes will be 8
      byte values.

and suggested "This means the sizes are eight byte if there is a ZIP64 extended information extra field and four bytes if there is none.  This is not what java.util.zip implements, ZipOutputStream#writeEXT writes eight byte data if one of the sizes exceeds 0xFFFFFFFF but it never writes any ZIP64 extended information extra field.  This means conforming implementations will "think" the sizes are four bytes while in fact they are eight bytes."


Since the submitter did not leave us a valid email address to contact and the bug (#7073588) does not appear to be pushed out to our external bug website promptly, while I also believe this might be interested for anyone who might be using the new ZIP64 support in JDK7's java.util.zip  package. I posted my evaluation here. The "spec" referred below is the ZIP file format specification APPNOTE.TXT from PKWare.



The "compressed/uncompressed size" part of the loc spec states

      If bit 3 of the general purpose bit flag is set,
      these fields are set to zero in the local header and the
      correct values are put in the data descriptor and
      in the central directory.  If an archive is in ZIP64 format
      and the value in this field is 0xFFFFFFFF, the size will be
      in the corresponding 8 byte ZIP64 extended information
      extra field.

and the ZIP64 Information Extra Field (0x0001) spec says

      The following is the layout of the zip64 extended
      information "extra" block. If one of the size or
      offset fields in the Local or Central directory
      record is too small to hold the required data,
      a Zip64 extended information record is created.
      The order of the fields in the zip64 extended
      information record is fixed, but the fields will
      only appear if the corresponding Local or Central
      directory record field is set to 0xFFFF or 0xFFFFFFFF.
      ...
      This entry in the Local header must include BOTH original
      and compressed file size fields.

The above spec appears to say three things here

(1) If the loc size and csize  are to be stored in data descriptor (when the general purpose flag bit 3 is set), these fields are set to
ZERO.

(2) If this archive is in ZIP64 format (what does this really mean? one possible interpretation is that there is ZIP64 extention appears at the "extra field" of this loc) AND these 2 fields are 0xFFFFFFFF, then the corresponding size/csize can be found at the ZIP64 extention in the extra field.

(3) In order to have size and csize appears in ZIP64 extended info extra field, their corresponding fields in loc MUST be 0xffffffff. Which means if the csize/size have to be present in loc's ZIP64 extra field (>4G), the size/csize fields in this loc MUST be 0xffffffff.

Here is the problem, if the bit 3 of the general purpose flag is set, therefor the size and csize fields in loc MUST be ZERO, then  (3) can NOT be true.

And from implementation point view, the reason why we have the "data description" is mostly because you don't know the value of size and csize yet when writing the loc (such as in the streaming case), it really does not make sense to have a zip64 extended info extra field as well, which is part of the loc, and you still don't know the size/ csize values when writing it.

That said, this is obviously contradicting to what is specified in the extracting part of the "data descriptor" spec, as quoted,

      When compressing files, compressed and uncompressed sizes
      should be stored in ZIP64 format (as 8 byte values) when a
      files size exceeds 0xFFFFFFFF.   However ZIP64 format may be
      used regardless of the size of a file.  When extracting, if
      the zip64 extended information extra field is present for
      the file the compressed and uncompressed sizes will be 8
      byte values.

which says you CAN have a ZIP64 extended info extra field in a loc (sizeZ&csize must be 0xffffffff), even if the bit 3 of the general flag is set (size&csize must be 0).


I don't have an answer for this, so I sent PKWare, the owner of the ZIP format specification, an email for clarification.

The clarification came back promptly as:

--------------------------------------------------------
Thank you for your interest in the ZIP format.  I reviewed the APPNOTE and I believe the documentation should be updated for more clarity on this.  I will log a ticket to get further clarification on this record into a future version of the APPNOTE.  

To address your question, you would not use the Data Descriptor (presence is signaled using bit 3) at the same time as the ZIP64 Extended Information Extra Field (which uses the 0xFFFFFFFF value and "Extra Field" 0x0001).  When using the Data Descriptor, the values would be written as ZERO.  When alternatively, the ZIP64 extended information extra field is used, the values should be 0xFFFFFFFF.

I hope this helps with your understanding.  Please let me know if there is any additional information I can provide to you on this topic.  
---------------------------------------------------------

It appears the suggestion is to NOT have both Data Descriptor and ZIP64 extended Information Extra Field at the "same time". Our implementation is doing exactly that, so I concluded this one as "not a defect" for now.


About

xuemingshen

Search

Categories
Archives
« August 2011
SunMonTueWedThuFriSat
 
1
2
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
   
       
Today