User submitted CSV file upload UTF-8 concern

2023-02-27 03:43 问答作者：

I have a feature that uploads a user submitted CSV file into my database using fgetcsv etc.

My database has a collation of utf8_general_ci and the website charset is set to utf-8.

How can I ensure that when inserting the data from CSV into my database for display on the website, the correct encoding is set?

Do I have to test every string u开发者_Python百科sing something like mb_detect_encoding (seems a bit memory intensive) or can I just utf8_encode the whole string. Or should I not be worrying at all?

Auto-detecting the encoding of a user-submitted file is indeed extremely shaky.

Consider a manual approach:

Have the user upload the file.
In an iframe, show them a preview of how the data is going to be inserted. (like OpenOffice does when importing an unknown file into a spreadsheet). An illustration of that is here
Next to that, show a drop-down offering all relevant encodings.
If the user switches to a different encoding, update the preview on-the-fly using iconv():
```
$data = iconv($chosen_encoding, "utf-8", $data);
```
Once the user has confirmed that the data is displayed correctly in the selected encoding, do a final iconv() on the data and insert it into your database.

The downside of this is that the user needs to be educated about an issue that they're most likely ignorant of, and rightly not interested in. But it's the only way to make sure the data that enters the system is okay.

Re your comment:

I really want to make this transparent to the user. Would doing a utf8_encode on the string at least ensure the proper encoding is set regardless, or would it screw all of the data up?

utf8_encode is just a synonym for iconv("iso-8859-1", "utf-8", $data). If the incoming data is not ISO-8859-1, it will get screwed up by the process. It's a tricky issue.

If you need this to be transparent, you'll have to try your luck with mb_detect_encoding - on the full file unfortunately, because ISO-8859-1 and UTF-8 share the same set of base (ASCII) characters but differ in everything else like Umlauts ÄÖÜ.

Note that encoding detection is close to useless if files come in from all over the world (ie. could have any encoding)

继续阅读：character-encoding php utf-8

User submitted CSV file upload UTF-8 concern

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？