How do browsers/PHP handle characters outside the set characterset?

2022-12-25 06:20 问答作者：

I'm looking into how characters are handled that are outside of the set characterset for a pag开发者_如何学运维e.

In this case the page is set to iso-8859-1, and the previous programmer decided to escape input using htmlentities($string,ENT_COMPAT). This is then stored into Latin1 tables in Mysql.

As the table is set to the same character set as the page, I am wondering if that htmlentities step is needed. I did some experiments on http://floris.workingweb.nl/experiments/characters.php and it seems that for stuff inside Latin1 some characters are escaped, but for example with a Czech name they are not.

Is this because those characters are outside of Latin1? If so, then the htmlentities can be removed, as it doesn't help for stuff outside of Latin1 anyway, and for within Latin1 it is not needed as far as I can see now...

htmlentities only translates characters it knows about (get_html_translation_table(HTML_ENTITIES) returns the whole list), and leaves the rest as is. So you're right, using it for non-latin data makes no sense. Moreover, both html-encoding of database entries and using latin1 are bad ideas either, and I'd suggest to get rid of them both.

A word of warning: after removing htmlentities(), remember that you still need to escape quotes for the data you're going to insert in DB (mysql_escape_string or similar).

He could have used it as a basic safety precaution, ie. to prevent users from inserting HTML/Javascript into the input (because < and > will be escaped as well).

btw If you want to support Eastern and Western European languages I would suggest using UTF-8 as the default character encoding.

Yes
though not because Czech characters are outside of Latin1 but because they share the same places in the table. So, database take it as corresponding latin1 characters.

using htmlentities is always bad. the only proper solution to store different languages is to use UTF-8 charset.

Take note that htmlentities / htmlspecialchars have a 3rd parameter (since PHP 4.1.0) for the charset. ISO-8859-1 is the default so if you apply htmlentities without a 3rd parameter to a UTF-8 string for example, the output will be corrupted.

You can detect & convert the input string with mb_detect_encoding and mb_convert_encoding to make sure the input string match the desired charset.

继续阅读：character-encoding php

How do browsers/PHP handle characters outside the set characterset?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？