开发者

Help Replacing Non-ASCII character in Python

I have a bunch of HTML files I downloaded using HTTPLIB2 package in Python. ' ' are showing as 'Â '.

<font color="#ff0000">02/12/2004Â </font> is showing while <font color="#ff0000">02/12/2004&nbsp;</font> is t开发者_JS百科he desired format.

How do I replace the 'Â ' with '&nbsp;' in Python? Thanks a lot!


You've got an encoding problem. Instead of trying to remove this characters, look for the encoding of the page, then when you read the file, use the codecs module instead of open(), using the proper character encoding.


filtered_content = filter(lambda x: x in string.printable, content)

This solved my problem. Thank you!


s.replace('Â ', '&nbsp;');

However, while I haven't used HTTPLIB2, I'm pretty sure something is wrong if the source of the HTML files is being changed when you download them. It may be that there's a decoding problem going on. What version of Python are you using? If it's Python 3, the contents will be byte sequences, not strings, so you'll have to specify the right codepage to decode the bytes to.

http://code.google.com/p/httplib2/wiki/ExamplesPython3

EDIT: If you aren't limited to using just httplib2, perhaps you could try looking into using the urllib, urllib2, or httplib modules that are part of the Python 2.6 standard library?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜