Non-English Characters are being converted to Decimal
While I was checking an RSS feed, in a browser I can see the text as below:
装,配上超短迷你裙,太过暴露,也很不得体。大专学生的随性打扮...
But the same source code view is converted to decimal as below:
#30701裤、迷你裙、吊带装、人字拖鞋......大
987学生的穿着打扮及潮流品味,一直都是是大家讨论的
8909门话题。&
Is this due to localization of the content or the file is saved in different encoding? I can see the the file is saved using UTF-8.
I am trying to parse t开发者_Go百科he RSS feed using Python. But after parsing, I am only getting the decimal values, not the actual characters.
It's not that the source view is converting it to decimal - it's that the browser is handling the entities and converting them to the relevant non-ASCII characters. It's possible that it's being a little generous in terms of converting entities which don't have a terminating ';'.
The server is almost certainly serving what you're seeing in the source view.
For some reason, the tool that created the feed decided to convert all characters to their Unicode code point string representation. Odd indeed but only the author of that tool can answer.
Aren't they just stored as HTML entities by the author of the page?
http://tlt.its.psu.edu/suggestions/international/bylanguage/thaichart.html
This is how the browser handles this. Write simple html page, put this 'decimal' there and check what you get.
Yes, you can use UTF-8 characters in HTML, but you must then set page encoding. Encoding UTF-8 characters decimally, such as in you example, is simply safer, so many pages prefer to do it so. It is specified in HTML standards, so if you wish to parse HTML manually, you must be able to deal with it.
精彩评论