开发者

Using DotNetZip Library unzip file with non ASCII characters

I'm trying to unzip a file, using DotNetZip Library.

The file contains folders and files with danish characters (æøåÆØÅ).

TotalCommander, 7Zip, Windows own zip all extract the files correctly, but DotNetZip Library mangles the danish characters.

Ex: File_æøåÆØÅ.txt becomes File_æ¢åÆ¥Å.txt

insted of it contains a ¢. insted of a Ø it contains a ¥.

Code:

using (var zipFile = ZipFile.Read(@"File_æøåÆØÅ.zip"))
{
     zipFile.ExtractAll(@"File_æøåÆØÅ",
          开发者_运维知识库              ExtractExistingFileAction.OverwriteSilently);
}

I'm using the default encoding("da-DK" culture), I have tried other encodings like UTF8 etc.

How can I unzip a file containing filenames with Danish characters?


To handle that zip file, explicitly specify the Danish code page when reading the zip:

var encoding = System.Text.Encoding.GetEncoding("da-DK");
using (var zipFile = ZipFile.Read(@"File_æøåÆØÅ.zip", encoding)) {
  zipFile.ExtractAll(@"File_æøåÆØÅ",
                     ExtractExistingFileAction.OverwriteSilently);
} 

The reason you need to do this explicitly:
The zip spec allows two text encodings for filenames and comments within a zip file: IBM437 and UTF8. When either of these compliant encodings are used, the zipfile metadata explicitly specifies it. DotNetZip, or any library, can confidently use the encoding that is specified in the zip file.

There is no way for a zip file to specify an encoding that is not one of those two. The zip spec does not provide a way to do this. Some zip libraries or tools construct zip files that do not comply with the spec in this regard; the zip files use text encodings like "da-DK" or CP950 or whatever. Strictly speaking they are not compliant to the spec, but the tools still construct them. Zip files like this are not uncommon.

In cases like this some libraries or tools guess that the encoding used in the zip file is the same as the default encoding on the machine. This isn't safe, or guaranteed to work, but it is an assumption that works in the small case - where the zipfile was created by a non-compliant library or tool on the local machine. If you construct a zipfile with the default (non-compliant) text encoding, and then send it from Stokholm to Shanghai, using the "assume the default encoding" strategy at the time of reading will fail.

DotNetZip does not make the assumption. In cases where the zipfile uses a non-compliant text encoding, there's no indication in the zipfile about which encoding is used, therefore DotNetZip uses the standard encoding - IBM437 - to read the file. There's no way for DNZ to know that this is "wrong". If you want to override that behavior you need to use the ZipFile.Read() method that accepts a different encoding.

This is all described in the DotNetZip documentation, specifically in the ZipFile.ProvisionalAlternateEncoding property.


I was using filestream for reading and as far as I remember it was working (DotNetZip-v1.9). Code for reading:

using (FileStream fs = File.OpenRead(filePath))
{
     ZipFile zf = ZipFile.Read(fs);
     ICollection<ZipEntry> entries = zf.Entries;
     foreach (ZipEntry entry in entries)
     {
          string path = entry.FileName; // 
     }
}

And for making a zip archive:ZipFile zip = new ZipFile(Encoding.UTF8);


First of overriding DotNetZip default encoding using

zip.AlternateEncodingUsage = ZipOption.Always;

is dangerous because it always overrides zip encoding, despite what zip actually uses. I have used by myself

zip.AlternateEncoding = System.Text.Encoding.UTF8;
zip.AlternateEncodingUsage = ZipOption.AsNecessary;

So use utf-8 when needed.

But related to codepage discussion - I have fixed this within dotnetzip itself (local copy which I have) and changed default codepage from "ibm437" to "ibm861".

I've used 7-zip and windows zipping tool to create special kind of zip, and was using special character 'ø' in filename. Based on testing results default encoding used by windows and 7-zip is "ibm861", not "ibm437" like most of documentation specifies.

Fix can be applied by searching "ibm437" string and replacing it with "ibm861" in dotnetzip itself.

Here is where I located some mention about encoding page: http://www.nudoq.org/#!/Packages/DotNetZip/Ionic.Zip/ZipInputStream/P/ProvisionalAlternateEncoding


This sounds simply like a bug in "DotNetZip" - have you tried SharpZipLib or ZipPackage (in the BCL)? Encoding typically relates to the file contents, not the name; so this shouldn't be a factor.

You should report this (with an example) to the author.


I've had a problem with unzipping. In zip file my app should read I have special eastern european characters like šđčćž. WinRAR or 7Zip unzipped it well, but with DNZ library (IonicZip 1.9.1.8) instead of š i got µ.

I was trying like 15 different encondings, when finally found out that given zip file is ibm852. Now, this code sample worked for me:

ZipFile zf = new ZipFile(path, System.Text.Encoding.GetEncoding("ibm852"));
zf.ExtractAll(loc, ExtractExistingFileAction.OverwriteSilently);

Setting AlternateEncoding property like in the following snippet didn't help me:

using (ZipFile zz = ZipFile.Read(path))
{
    zz.AlternateEncodingUsage = ZipOption.Always;
    zz.AlternateEncoding = System.Text.Encoding.GetEncoding("ibm852");
    zz.ExtractAll(loc, ExtractExistingFileAction.OverwriteSilently);
}

I don't have time to investigate why, probably you must set encoding when calling constructor, because I didn't found encoding parameter in Read method.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜