Should I convert overlong UTF-8 strings to their shortest normal form?

2022-12-28 13:27 问答作者：

I've just been reworking my Encoding::FixLatin Perl module to handle overlong UTF-8 byte sequences and convert them to the shortest normal form.

My question is quite simply "is this a bad idea"?

A number of sources (including this RFC) suggest that any over-long UTF-8 should be treated as an error and rejected. They caution against "naive implementations" and leave me with the impression that these things are inherently unsafe.

Since the whole purpose of my module is to clean up messy data files with mixed encodings and convert them to nice clean utf8, this seems like just one more thing I can clean up so the application layer doesn't have to deal with it. My code does not concern itself with any semantic meaning the resulting characters might have, it开发者_C百科 simply converts them into a normalised form.

Am I missing something. Is there a hidden danger I haven't considered?

Yes, this is a bad idea.

Maybe some of the data in one of these messy data files was checked to see that it didn't contain a dangerous sequence of ASCII characters.

The canonical example that caused many problems: '\xC0\xBCscript>'. ‘Fix’ the overlong sequence to plain ASCII < and you have accidentally created a security hole.

No tool has ever generated overlongs for any legitimate purpose. If you're trying to repair mixed encoding files, you should consider encountering one as a sign that you have mis-guessed the encoding.

I don't think this is a bad idea from a security or usability perspective.

From security perspective you should be sanitizing user input before use. So you can run your clean up routines, and then make sure the data doesn't contain greater-than/less-than symbols <> before it is printed out. You should also make sure you call mysql_real_escape_string() before inserting it into the database. Keep in mind that language encoding issues such as GBK vs Latin1 can lead to sql injection when you aren't using mysql_real_escape_string(). (This function name should be pretty similar regardless of your platform specific mysql library bindings)

Sanitizing all user input is generally a terrible idea because you don't know how the specific variable will be used. For instance sql injection and xss have very different control characters involved and the same sensitization for both often leads to vulnerabilities.

I don't know if it is a bad idea in your scenario, however, as this kind of change is not bijective, it may lead to data loss.

If you incorrectly detected the encoding of your data, you may interpret data as being legitimate UTF-8 overlongs and change them in the shortest normal form. There will be no way to later retrieve the original data.

As a personal experience, I know that when such things may happen, they WILL and you will potentially not notice the error before it is too late...

继续阅读：encoding perl security utf-8

Should I convert overlong UTF-8 strings to their shortest normal form?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？