开发者

Non-UTF8 files (Google CSV file)

I'm running into weird encoding issues when handling uploaded files.

I need to accept any sort of text file, and be able to read the contents. Specifically having trouble with files downloaded from a Google Contacts export.

I've done the usual utf8_encode/decode, mb_detect_encoding, etc. Always returns as if the string is UTF-8, and tried many iconv options to try and revert encoding, but unsuccessful.

test.php

header('Content-type: text/html; charset=UTF-8');

if ($stream = fopen($_FILES['list']['tmp_name'], 'r'))
{
    $string = stream_get_contents($stream);

    f开发者_C百科close($stream);
}

echo substr($string, 0, 50);
var_dump(substr($string, 0, 50));
echo base64_encode(serialize(substr($string, 0, 50)));

Output

��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
czo1MDoi//5OAGEAbQBlACwARwBpAHYAZQBuACAATgBhAG0AZQAsAEEAZABkAGkAdABpAG8AbgAiOw==


The beginning of the string carries the bytes \xFF \xFE which represent the Byte Order Mark for UTF-16 Little Endian. All letters are actually two-byte sequences. Mostly a leading \0 followed by the ASCII character.

Printing them on the console will make the terminal client interpret the UTF-16 sequences correctly. But you need to manually decode it (best via iconv) to make the whole array displayable.


When I decoded the base64 piece, I saw a strange mixed string: s:50:"\xff\xfeN\x00a\x00m\x00e\x00,\x00G\x00i\x00v\x00e\x00n\x00 \x00N\x00a\x00m\x00e\x00,\x00A\x00d\x00d\x00i\x00t\x00i\x00o\x00n\x00". The part after the second : is a 2-byte Unicode (UCS2) string enclosed in ASCII ", while "s" and "50" are plain ASCII. That \ff\fe piece is a byte-order mark of a UCS2 string. This is insane but parseable.

I suppose that you split the input string by :, strip " from beginning and end and try to decode each resulting string separately.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜