Non-UTF8 files (Google CSV file)
I'm running into weird encoding issues when handling uploaded files.
I need to accept any sort of text file, and be able to read the contents. Specifically having trouble with files downloaded from a Google Contacts export.
I've done the usual utf8_encode/decode, mb_detect_encoding, etc. Always returns as if the string is UTF-8, and tried many iconv options to try and revert encoding, but unsuccessful.
test.php
header('Content-type: text/html; charset=UTF-8');
if ($stream = fopen($_FILES['list']['tmp_name'], 'r'))
{
$string = stream_get_contents($stream);
f开发者_C百科close($stream);
}
echo substr($string, 0, 50);
var_dump(substr($string, 0, 50));
echo base64_encode(serialize(substr($string, 0, 50)));
Output
��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
��N�a�m�e�,�G�i�v�e�n� �N�a�m�e�,�A�d�d�i�t�i�o�n�
czo1MDoi//5OAGEAbQBlACwARwBpAHYAZQBuACAATgBhAG0AZQAsAEEAZABkAGkAdABpAG8AbgAiOw==
The beginning of the string carries the bytes \xFF \xFE which represent the Byte Order Mark for UTF-16 Little Endian. All letters are actually two-byte sequences. Mostly a leading \0 followed by the ASCII character.
Printing them on the console will make the terminal client interpret the UTF-16 sequences correctly. But you need to manually decode it (best via iconv) to make the whole array displayable.
When I decoded the base64 piece, I saw a strange mixed string: s:50:"\xff\xfeN\x00a\x00m\x00e\x00,\x00G\x00i\x00v\x00e\x00n\x00 \x00N\x00a\x00m\x00e\x00,\x00A\x00d\x00d\x00i\x00t\x00i\x00o\x00n\x00"
. The part after the second :
is a 2-byte Unicode (UCS2) string enclosed in ASCII "
, while "s" and "50" are plain ASCII. That \ff\fe
piece is a byte-order mark of a UCS2 string. This is insane but parseable.
I suppose that you split the input string by :
, strip "
from beginning and end and try to decode each resulting string separately.
精彩评论