Python encode issue with beautifulsoup
Hello I`ve got a problem which encoding
When I put string to beautifulsoup lost all National char
addr = "http://zjazdowa.com.pl/index.php/aktualne-ceny-warzyw-i-owocow-.html"
content = urllib2.urlopen(addr) .read()
html_pag = BeautifulSoup(content) #<- there i lost all national letters
table_html= html_pag.find("div", id="808")
In the header file I have:
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from BeautifulSoup import Beau开发者_如何学运维tifulSoup
import urllib2, string, re , sys
reload(sys)
sys.setdefaultencoding("utf-8")
according to the documentation of BeautifulSoup all the input is transformed to UTF8 internally:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Hello")
soup.contents[0]
# u'Hello'
soup.originalEncoding
# 'ascii'
if your input does not specify the encoding (eg, meta-tags), BeautifulSoup guesses. you can disable the guessing by specifying the encoding of the input via the fromEncoding
paramter to BeautifulSoup:
soup = BeautifulSoup("hello", fromEncoding="UTF-8")
or is your real problem the 'broken' output of the result to the console?
And your code works perfectly fine:
>>> addr = "http://zjazdowa.com.pl/index.php/aktualne-ceny-warzyw-i-owocow-.html"
>>> content = urllib2.urlopen(addr) .read()
>>> html_pag = BeautifulSoup(content) #<- there i lost all national letters
>>> table_html= html_pag.find("div", id="808")
>>> print table_html.findAll('td')[8].string
Kapusta włoska
A few notes on this:
#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
import urllib2, string, re , sys
reload(sys)
sys.setdefaultencoding("utf-8")
reload
reloads a module. I'm not sure what you're hoping to do by reloading sys
, but it isn't buying you anything.
精彩评论