HTML Encoding characters not in the character set
We have a web app which uses the ISO-8859-1 character set. Occationaly users have 'strange' names which contain characters like Š (html encoded here for your convenience). We store this in our database, but we can't display it correctly.
What is the best way of dealing with this? I'm thinking I should automatically convert characters outside the character set with its HTML Entity number encoding ( Š to Š
)
But I'm having problems finding out how to do this automatically (without using a table of all values).
This code works for extended ASCII characters like 'å' (that are present in ISO-8859开发者_开发技巧-1). I would like to do the same with other characters. Is there a pattern in these HTML entity encoding values I can use?
unsigned int c;
for( int i=0; i < html.GetLength(); i++)
{
c = html[i];
if( c > 255 || c < 0 )
{
CString orig = CString(html[i]);
CString encoded = "&#";
encoded += CTool::String((byte)c);
encoded += ";";
html.Replace(orig, encoded);
}
}
The webpage should instruct the browser to display the response in UTF-8. This usually happens by supplying the charset in the Content-Type
response header like text/html;charset=UTF-8
.
Response.AppendHeader("Content-Type", "text/html;charset=UTF-8");
The HTML/XML entities are solely there so that you will be able to save the webpage source in an encoding other than UTF-8.
html appears to be a "Unicode" CString. That means it's UTF-16 encoded. The "&#ddd" syntax uses the Unicode code point number. Usually, this is quite simple. Š
is U+0160, which means it's 0x0160 in UTF-16. Tha's of course 352 decimal, so you get Š
.
You only have a problem when you encounter a character outside the Basic Multilingual Plane (BMP), which is past U+FFFF. This no longer fits in 16 bits, and will therefore take TWO characters in your html
string. Yet, it should produce only one &#ddddd
value. This is so rare that you often can ignore it.
精彩评论