What is the best way to handle uploaded text files of different encodings?

2023-02-13 09:09 问答作者：

Internally our PHP application uses UTF-8, and we do processing on .csv files and fixedwidth (text) files. We have written some nice librari开发者_JAVA技巧es to work with these files (classes essentially).

We recently added the ability for administrators to upload files of these types so they could be processed and quickly ran into issues across multiple OS's. What we soon realised is that the files being read in were of different encodings to our application (i.e Windows-1252 or ISO-8859).

Since it is impossible to control what encoding of files are submitted to us my question is; what is the best way to handle uploaded text files of different encodings? I can think of two solutions currently:

When a file is received, detect its encoding and convert it to UTF-8, then re-save it. The rest of the system then only needs to be UTF-8 aware and can ignore 'encoding' issues.
Change the csv / fixed width library so they become encoding aware themselves

I also thought about the pro's and con's of these too:

Converting input makes the rest of the libraries smaller and reduces duplication, however it seems wasteful in terms of processing
Make libraries internally aware - this seems to involve more code but might be more speedy

Thoughts please?

Edit: I am really interested to know where to apply, architecturally, character encoding/transforming should happen - is it at the point of input or during the use of the files?

This is tricky, and there is no perfect solution.

phpMyAdmin for example offers the user the possibility to specify the encoding of the uploaded file. Seeing as all the automatic detection methods are not 100% reliable, if at all possible, this is the best way to go IMO.

An import dialog that allows the user to select the right encoding while seeing a preview of what their data looks like in that encoding might be optimal.

A way to do this could be

Receive the uploaded file and store it in a temporary file
Display a dialog with a drop-down selection of the most important encodings
Have an iframe that, when the selected value in the drop-down changes, converts the contents of the uploaded file using iconv() (source = the selected encoding; target = utf-8) and shows a preview.
When the user selects an encoding, do a final iconv() and store the file as UTF-8.

Automatic encoding detection for CSV can be difficult, based on my own experience. It's reliable only for a small subset of encodings (such as the UTF family and a few others). In that regards, Pekka's suggestions aim in the right direction - by placing the burden of identifying the correct encoding on the end-user.

Keeping UTF8 as the internal format is a good idea but I suggest keeping the charset issues separate from CSV processing since the format itself has no rules about encoding. While it's true that decoding on-the-fly is somewhat more efficient, the increase in code complexity might not justify the gain. Keeping the software components specialized is always a good idea.

Character transformations should happen inside the server-side controller, before handing control over to the CSV processor, provided the system adheres to MVC.

继续阅读：character-encoding encoding php text utf-8

What is the best way to handle uploaded text files of different encodings?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？