开发者

Python encode issue with beautifulsoup

Hello I`ve got a problem which encoding

When I put string to beautifulsoup lost all National char

addr = "http://zjazdowa.com.pl/index.php/aktualne-ceny-warzyw-i-owocow-.html"                                
content = urllib2.urlopen(addr) .read()
html_pag = BeautifulSoup(content) #<- there i lost all national letters 
table_html= html_pag.find("div",  id="808") 

In the header file I have:

#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from BeautifulSoup import Beau开发者_如何学运维tifulSoup
import urllib2, string, re , sys
reload(sys)
sys.setdefaultencoding("utf-8")


according to the documentation of BeautifulSoup all the input is transformed to UTF8 internally:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Hello")
soup.contents[0]
# u'Hello'
soup.originalEncoding
# 'ascii'

if your input does not specify the encoding (eg, meta-tags), BeautifulSoup guesses. you can disable the guessing by specifying the encoding of the input via the fromEncodingparamter to BeautifulSoup:

soup = BeautifulSoup("hello", fromEncoding="UTF-8")

or is your real problem the 'broken' output of the result to the console?


And your code works perfectly fine:

>>> addr = "http://zjazdowa.com.pl/index.php/aktualne-ceny-warzyw-i-owocow-.html"                                
>>> content = urllib2.urlopen(addr) .read()
>>> html_pag = BeautifulSoup(content) #<- there i lost all national letters 
>>> table_html= html_pag.find("div",  id="808")
>>> print table_html.findAll('td')[8].string
Kapusta włoska

A few notes on this:

#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
import urllib2, string, re , sys
reload(sys)
sys.setdefaultencoding("utf-8")

reload reloads a module. I'm not sure what you're hoping to do by reloading sys, but it isn't buying you anything.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜