Get content-type from HTML page with BeautifulSoup

2023-02-19 12:58 问答作者：

I am trying to get the character encoding for pages that I scrape, but in some cases it is failing. Here is what I am doing:

resp = urllib2.urlopen(request)
self.COOKIE_JAR.extract_cookies(resp, req开发者_JAVA技巧uest)
content = resp.read()
encodeType= resp.headers.getparam('charset')
resp.close()

That is my first attempt. But if charset comes back as type None, I do this:

soup = BeautifulSoup(html)
if encodeType == None:
    try:
        encodeType = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})
    except AttributeError, e:
        print e
        try:
            encodeType = soup.findAll('meta', {'charset':lambda v:v.lower() != None})
        except AttributeError, e:
            print e
            if encodeType == '':
                encodeType = 'iso-8859-1'

The page I am testing has this in the header: <meta charset="ISO-8859-1">

I would expect the first try statement to return an empty string, but I get this error on both try statements (which is why the 2nd statement is nested for now):

'NoneType' object has no attribute 'lower'

What is wrong with the 2nd try statement? I am guessing the 1st one is incorrect as well, since it's throwing an error and not just coming back blank.

OR better yet is there a more elegant way to just remove any special character encoding from a page? My end result I am trying to accomplish is that I don't care about any of the specially encoded characters. I want to delete encoded characters and keep the raw text. Can I skip all of the above an tell BeautifulSoup to just strip anything that is encoded?

I decided to just go with whatever BeautifulSoup spits out. Then as I parse through each word in the document, if I can't convert it to a string, I just disregard it.

for word in doc.lower().split(): 
        try:
            word = str(word)
            word = self.handlePunctuation(word)
            if word == False:
                continue
        except UnicodeEncodeError, e:
            #word couldn't be converted to string; most likely encoding garbage we can toss anyways
            continue

When attempting to determine the character encoding of a page, I believe the order that should be tried is:

Determine from the HTML page itself via meta tags (e.g. <meta http-equiv="Content-Type" content="text/html; charset=utf-8">)
Determine encoding via the HTTP headers (e.g. Content-Type: text/html; charset=ISO-8859-1)
Finally, if the above don't yield anything, you can do something like use an algorithm to determine the character encoding of a page using the distribution of bytes within it (note that isn't guaranteed to find the right encoding). Check out the chardet library for this option.

继续阅读：lambda python

Get content-type from HTML page with BeautifulSoup

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？