fgetcsv() drops characters with diacritics (i.e. non-ASCII) - how to fix?
Similar questions:
Some characters in CSV file are not read during PHP fgetcsv() , fgetcsv() ignores special characters when they are at the beginning of line
My application has a form where the users can upload a CSV file (its 5 internal users have always uploaded a valid file - comma-delimited, quoted, records end by LF), and the file is then imported into a database using PHP:
$fhandle = fopen($uploaded_file,'r');
while($row = fgetcsv($fhandle, 0, ',', '"', '\\')) {
print_r($row);
// further code not relevant as the data is already corrupt at this point
}
For reasons I cannot change, the users are uploading the file encoded in the Windows-1250
charset - a single-byte, 8-bit character encoding.
The problem: and开发者_运维技巧 some (not all!) characters beyond 127 ("extended ASCII") are dropped in fgetcsv()
. Example data:
"15","Ústav"
"420","Špičák"
"7","Tmaň"
becomes
Array (
0 => 15
1 => "stav"
)
Array (
0 => 420
1 => "pičák"
)
Array (
0 => 7
1 => "Tma"
)
(Note that č
is kept, but Ú
is dropped)
The documentation for fgetcsv says that "since 4.3.5 fgetcsv() is now binary safe", but looks like it isn't. Am I doing something wrong, or is this function broken and I should look for a different way to parse CSV?
It turns out that I didn't read the documentation well enough - fgetcsv() is only somewhat binary-safe. It is safe for plain ASCII < 127, but the documentation also says:
Note:
Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function
In other words, fgetcsv() tries to be binary-safe, but it's actually not (because it's also messing with the charset at the same time), and it will probably mangle the data it reads (as this setting is not configured in php.ini, but rather read from $LANG
).
I've sidestepped the issue by reading the lines with fgets
(which works on bytes, not characters) and using a CSV function from the comment in the docs to parse them into an array:
$fhandle = fopen($uploaded_file,'r');
while($raw_row = fgets($fhandle)) { // fgets is actually binary safe
$row = csvstring_to_array($raw_row, ',', '"', "\n");
// $row is now read correctly
}
精彩评论