Unzip files created with WinZIP with I18N file names?

2022-12-12 21:31 问答作者：

People these days create their ZIP archives with WinZIP, which allows for internationalized (i.e. non-latin: cyrillic, greek, chinese, you name it) file names.

Sadly, trying to unpack such file causes trouble: UNIX unzip creates garbage-named files and dirs like "®£¤ ©¤¥èì". Java and its jar开发者_JAVA百科 command fails miserably on such archives.

Is there a passable way to unpack such files programmatically? UNIX or Java.

DotNetZip supports unicode and arbitrary encodings for filenames within zipfiles, either for reading or writing zips.

It's a .NET library. For Unix usage, you would need Mono as a pre-requisite.

If the zipfile is correctly constructed by WinZip, in other words if it's compliant with the zip spec from PKWare, then there's no special work you need to do to specify the encoding at the time you unpack it. According to the zip spec, there are two supported encodings used for filenames in zipfiles: UTF-8 and IBM437. The use of one or the other of these encodings is specified in the zip metadata and any zip library can detect and use it. DotNetZip automatically detects it when reading a compliant zip. like this:

using (var zip = ZipFile.Read("thearchive.zip"))
{
    foreach (var e in zip) 
    {
        // e.FileName refers to the name on the entry
        e.Extract("extract-directory");
    }
}

There are archive programs that produce zips that are "non compliant" w.r.t. encoding. WinRar is one - it will create a zip that has filenames encoded in the default encoding in use on the computer. In Shanghai it will use cp950, while in Iceland, something else, and in Lisbon, something else. The advantage to "non compliance" here is that Windows Explorer will open and correctly display i18n-ized filenames in such zips. In other words, "non compliance" is often what people want, because Windows doesn't (yet?) support UTF-8 zip files.

(This all has to do with the encoding used in the zipfile, not the encoding used in the files contained in the zip file)

The zip spec doesn't allow for the specification of an arbitrary text encoding in the zip metadata. In other words if you use cp950 when creating the zip, then your extract logic needs to "know" to use cp950 when extracting - nothing in the zip file carries that information. In addition, of course, the zip library you use to programmatically extract must support arbitrary encodings. As far as I know, Java's zip library does not. DotNetZip does. Like so:

using (ZipFile zip = ZipFile.Read(zipToExtract,
                                  System.Text.Encoding.GetEncoding(950)))
{
  foreach (ZipEntry e in zip)
  {
     e.Extract(extractDirectory);
  }
}

DotNetZip can also create zip files with arbitrary encodings - "non compliant" zips.

DotNetZip is free, and open source.

The solution I've found: Apache commons-compress can unzip such archives just fine, if supplied with correct fallback charset.

继续阅读：internationalization unzip zip

Unzip files created with WinZIP with I18N file names?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？