Reading Zip files with non-ASCII file names
By Weijun on Mar 05, 2007
In a zip file, each entry occupies one slot, which includes a description and the compressed data. Of course, one field of the description is the file name. There's this unfortunate thing that in the original ZIP specification the encoding of the filename is not specified. Most applications choose the default encoding (or, native encoding of the underlying operating systems). For Java, since i18n was considered at the very beginning, and the language has a strong wish to be cross-platform and cross-locale, it chooses the UTF-8 encoding, since that's the only (?) charset that's guaranteed to be supported on every platform and supports every character in the world.
So here comes the co-op problem. If you compress files with Chinese names using WinRAR, it will not be opened by the jar command. Another popular sofware, GMail, also uses UTF-8 when you download several attachments at once, this time, the file you get cannot be opened by WinRAR (use jar, only in JDK).
I do not care about this until recently I try to start playing with Java ME. The first program I want to write is showing what's on now for various TV stations. On the website of CCTV (China Central Television) there's a schedule file for almost all TV channels I can reach. Each channel has one file in the zip bundle, and the file name is the channel's name, in Chinese, in the GB2312 encoding.
If I were writing a Java SE program, I won't hesitate a moment to go inside JDK, change the lines where UTF-8 is forced and thus OK. This is my version of private JDK, I can do anything on it. But for Java ME, I don't know a way to substitute the JRE in my phone with a customized one. (Maybe I can ask the ME guys underfloor)
This is what I did:
- Read the zip spec, only the first 4 pages, about the overall layout and the structure of the entry header
- Write a FilterInputStream, override the read method to translate the file name from Chinese encoding (gb2312) to UTF-8 on the fly, of course, also update the filename length. This filter let all other fields (as well as compressed data) go thru transparently. After all zip entries, there are still quite a lot of blocks (started with [archive decryption header] if you read the spec). I really don't understand what they are. Shouldn't any data after all zip entries be useless, at least in streaming mode? So, I regard them as a big EOF, and let read() returns -1 when this part is met.
- Insert this filter between the InputStream of zip file and ZipInputStream, everything is OK now.