Curly quotes from Word in UTF-8 using mb_detect_encoding
When detecting the encoding of some text from Word (saved as a CSV file) using...
$encoding = mb_detect_encoding($value, 'WINDOWS-1252, ISO-8859-1', true);
$value = iconv($encoding, 'UTF-8//IGNORE', $value);
If a string has curly quotes the $enc开发者_如何学Gooding
will be set to ISO-8859-1 not WINDOWS-1252 which it should be, so the string will read "self-motivated" with funny boxes around them and not “self-motivated” in it's UTF-8 encoding.
Any ideas on how to resolve this other than replacing the curly quotes, because this could effect other characters too?
Windows-1252 and ISO-8859-1 only differ in bytes 7F to 9F. They exist in the former but not in the latter. If you know your encode is either Windows-1252 or ISO-8859-1 you can determine which it is by the existence of such bytes. If no such bytes are included, and you know it is one of these two encodings, you can convert from either.
I once created a function to convert almost everything to UTF8, it has also some content sniffing functionality inside, may be this helps you?
http://php.net/manual/function.utf8-encode.php#102382
精彩评论