开发者

User submitted CSV file upload UTF-8 concern

I have a feature that uploads a user submitted CSV file into my database using fgetcsv etc.

My database has a collation of utf8_general_ci and the website charset is set to utf-8.

How can I ensure that when inserting the data from CSV into my database for display on the website, the correct encoding is set?

Do I have to test every string u开发者_Python百科sing something like mb_detect_encoding (seems a bit memory intensive) or can I just utf8_encode the whole string. Or should I not be worrying at all?


Auto-detecting the encoding of a user-submitted file is indeed extremely shaky.

Consider a manual approach:

  • Have the user upload the file.

  • In an iframe, show them a preview of how the data is going to be inserted. (like OpenOffice does when importing an unknown file into a spreadsheet). An illustration of that is here

  • Next to that, show a drop-down offering all relevant encodings.

  • If the user switches to a different encoding, update the preview on-the-fly using iconv():

    $data = iconv($chosen_encoding, "utf-8", $data);
    
  • Once the user has confirmed that the data is displayed correctly in the selected encoding, do a final iconv() on the data and insert it into your database.

The downside of this is that the user needs to be educated about an issue that they're most likely ignorant of, and rightly not interested in. But it's the only way to make sure the data that enters the system is okay.

Re your comment:

I really want to make this transparent to the user. Would doing a utf8_encode on the string at least ensure the proper encoding is set regardless, or would it screw all of the data up?

utf8_encode is just a synonym for iconv("iso-8859-1", "utf-8", $data). If the incoming data is not ISO-8859-1, it will get screwed up by the process. It's a tricky issue.

If you need this to be transparent, you'll have to try your luck with mb_detect_encoding - on the full file unfortunately, because ISO-8859-1 and UTF-8 share the same set of base (ASCII) characters but differ in everything else like Umlauts ÄÖÜ.

Note that encoding detection is close to useless if files come in from all over the world (ie. could have any encoding)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜