开发者

® gets converted to ® in Python while parsing XML

My RSS feed ontains:

<title><![ CDATA[HBO Wins 19 Emmy® Awards, The Most of Any Network This Year]]></title>

Now I am parsing RSS and then assigning the title to title as below:

 for item in XML.ElementFromURL(feed).xpath('//item',namespaces=NEWS_NS开发者_Python百科):
        title = item.find('title').text
        Log("Title :"+title)

and when I am checking the out put or the log file then I see the title as below:

HBO Wins 19 Emmy® Awards, The Most of Any Network This Year.

® gets converted to ® . Any I tried using HTML parser but no use.


You state that the encoding of the feed is ISO-8859-1.

In that case, if the bytes that you say should be interpreted as ® are in fact C2 AE, then the text really, truly is Emmy® Awards, and everything is working as it should. If the sender intended different text, they would have sent different data or set the encoding differently.

If the encoding of the feed were UTF-8, and the bytes sent over the wire were still C2 AE, then the text would be Emmy® Awards.

If the encoding of the feed were ISO-8859-1, and the bytes sent over the wire were simply AE, with no C2, then the text would be Emmy® Awards.

To be sure what the bytes are, use the od -x command in Unix or the d command in debug.exe for Windows. Don't trust Notepad in situations like this. It lies.


You've received some text encoded using UTF-8, but at some point those bytes are being incorrectly interpreted as ISO-8859-1 or another encoding instead.

Without more context, it's difficult to tell exactly where the mistake is taking place. You should first check the encoding being used to read your log file.


I tried the following and worked:

title = item.find('title').text
title = title.encode('iso-8859-1')

When I am getting the string converted to UTF-8(® to ® ) and I am converting it back to iso-8859-1(® to ® ) and getting the correct output

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜