开发者

How to open A HTML page with windows-1252 encoding in beautifulsoup

I try to parse a HTML document with beautifulsoup but I run in troubles. What is the best way to open a HTML document with windows-1252 encoding?

I tried with iconv to convert to utf-8 but this also doesn't work.

doc = open("e.html").read()

soup = BeautifulSoup(doc)

soup.findAll('p')

UnicodeEncodeError: 'ascii' codec can't encode character u开发者_Go百科'\xfc' in position 103: ordinal not in range(128)

When I open it without iconv I get the same error.

full traceback:

>>> soup.findAll('p')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 103: ordinal not in range(128)


I was getting a similar error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 723617: invalid continuation byte

What worked for me was to specify the input encoding like so:

page = open("page.html", encoding="windows-1252")

soup = BeautifulSoup(page.read(), "html.parser")


Try something like this:

doc = open("e.html").read()

doc = doc.decode('cp1252')

soup = BeautifulSoup(doc)

soup.findAll('p')
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜