开发者

Curly quotes from Word in UTF-8 using mb_detect_encoding

When detecting the encoding of some text from Word (saved as a CSV file) using...

$encoding = mb_detect_encoding($value, 'WINDOWS-1252, ISO-8859-1', true);
$value = iconv($encoding, 'UTF-8//IGNORE', $value);

If a string has curly quotes the $enc开发者_如何学Gooding will be set to ISO-8859-1 not WINDOWS-1252 which it should be, so the string will read "self-motivated" with funny boxes around them and not “self-motivated” in it's UTF-8 encoding.

Any ideas on how to resolve this other than replacing the curly quotes, because this could effect other characters too?


Windows-1252 and ISO-8859-1 only differ in bytes 7F to 9F. They exist in the former but not in the latter. If you know your encode is either Windows-1252 or ISO-8859-1 you can determine which it is by the existence of such bytes. If no such bytes are included, and you know it is one of these two encodings, you can convert from either.


I once created a function to convert almost everything to UTF8, it has also some content sniffing functionality inside, may be this helps you?

http://php.net/manual/function.utf8-encode.php#102382

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜