开发者

How do browsers/PHP handle characters outside the set characterset?

I'm looking into how characters are handled that are outside of the set characterset for a pag开发者_如何学运维e.

In this case the page is set to iso-8859-1, and the previous programmer decided to escape input using htmlentities($string,ENT_COMPAT). This is then stored into Latin1 tables in Mysql.

As the table is set to the same character set as the page, I am wondering if that htmlentities step is needed. I did some experiments on http://floris.workingweb.nl/experiments/characters.php and it seems that for stuff inside Latin1 some characters are escaped, but for example with a Czech name they are not.

Is this because those characters are outside of Latin1? If so, then the htmlentities can be removed, as it doesn't help for stuff outside of Latin1 anyway, and for within Latin1 it is not needed as far as I can see now...


htmlentities only translates characters it knows about (get_html_translation_table(HTML_ENTITIES) returns the whole list), and leaves the rest as is. So you're right, using it for non-latin data makes no sense. Moreover, both html-encoding of database entries and using latin1 are bad ideas either, and I'd suggest to get rid of them both.

A word of warning: after removing htmlentities(), remember that you still need to escape quotes for the data you're going to insert in DB (mysql_escape_string or similar).


He could have used it as a basic safety precaution, ie. to prevent users from inserting HTML/Javascript into the input (because < and > will be escaped as well).

btw If you want to support Eastern and Western European languages I would suggest using UTF-8 as the default character encoding.


Yes
though not because Czech characters are outside of Latin1 but because they share the same places in the table. So, database take it as corresponding latin1 characters.

using htmlentities is always bad. the only proper solution to store different languages is to use UTF-8 charset.


Take note that htmlentities / htmlspecialchars have a 3rd parameter (since PHP 4.1.0) for the charset. ISO-8859-1 is the default so if you apply htmlentities without a 3rd parameter to a UTF-8 string for example, the output will be corrupted.

You can detect & convert the input string with mb_detect_encoding and mb_convert_encoding to make sure the input string match the desired charset.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜