Non-UTF8 files (Google CSV file)

2023-02-06 18:10 问答作者：

I'm running into weird encoding issues when handling uploaded files.

I need to accept any sort of text file, and be able to read the contents. Specifically having trouble with files downloaded from a Google Contacts export.

I've done the usual utf8_encode/decode, mb_detect_encoding, etc. Always returns as if the string is UTF-8, and tried many iconv options to try and revert encoding, but unsuccessful.

test.php

header('Content-type: text/html; charset=UTF-8');

if ($stream = fopen($_FILES['list']['tmp_name'], 'r'))
{
    $string = stream_get_contents($stream);

    f开发者_C百科close($stream);
}

echo substr($string, 0, 50);
var_dump(substr($string, 0, 50));
echo base64_encode(serialize(substr($string, 0, 50)));

Output

��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
czo1MDoi//5OAGEAbQBlACwARwBpAHYAZQBuACAATgBhAG0AZQAsAEEAZABkAGkAdABpAG8AbgAiOw==

The beginning of the string carries the bytes \xFF \xFE which represent the Byte Order Mark for UTF-16 Little Endian. All letters are actually two-byte sequences. Mostly a leading \0 followed by the ASCII character.

Printing them on the console will make the terminal client interpret the UTF-16 sequences correctly. But you need to manually decode it (best via iconv) to make the whole array displayable.

When I decoded the base64 piece, I saw a strange mixed string: s:50:"\xff\xfeN\x00a\x00m\x00e\x00,\x00G\x00i\x00v\x00e\x00n\x00 \x00N\x00a\x00m\x00e\x00,\x00A\x00d\x00d\x00i\x00t\x00i\x00o\x00n\x00". The part after the second : is a 2-byte Unicode (UCS2) string enclosed in ASCII ", while "s" and "50" are plain ASCII. That \ff\fe piece is a byte-order mark of a UCS2 string. This is insane but parseable.

I suppose that you split the input string by :, strip " from beginning and end and try to decode each resulting string separately.

继续阅读：character-encoding php utf-8

Non-UTF8 files (Google CSV file)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？