How to open A HTML page with windows-1252 encoding in beautifulsoup
I try to parse a HTML document with beautifulsoup but I run in troubles. What is the best way to open a HTML document with windows-1252 encoding?
I tried with iconv to convert to utf-8 but this also doesn't work.
doc = open("e.html").read()
soup = BeautifulSoup(doc)
soup.findAll('p')
UnicodeEncodeError: 'ascii' codec can't encode character u开发者_Go百科'\xfc' in position 103: ordinal not in range(128)
When I open it without iconv I get the same error.
full traceback:
>>> soup.findAll('p')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 103: ordinal not in range(128)
I was getting a similar error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 723617: invalid continuation byte
What worked for me was to specify the input encoding like so:
page = open("page.html", encoding="windows-1252")
soup = BeautifulSoup(page.read(), "html.parser")
Try something like this:
doc = open("e.html").read()
doc = doc.decode('cp1252')
soup = BeautifulSoup(doc)
soup.findAll('p')
精彩评论