开发者

How to get `http-equiv`s in python?

I am using urllib2.urlopen to fetch a URL and get header information like 'charset', 'content-length'.

But some page set their charset with something like

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

And urllib2 doesn't parse this for me.

Is there any built-in tool I can use to get http-equiv information?

EDIT:

This is what I do to parse charset from a page

elem = lxml.html.fromstring(page_source)
content_type = elem.xpath(
        ".//meta[translate(@http-equiv, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')='content-type']/@content")
if content_type:
    content_type = content_type[0]
    for frag in content_type.split(';'):
        frag = frag.strip().lower(开发者_开发百科)
        i = frag.find('charset=')
        if i > -1:
            return frag[i+8:] # 8 == len('charset=')

return None

How can I improve this? Can I precompile the xpath query?


Find 'http-equiv' using BeautifulSoup

import urllib2
from BeautifulSoup import BeautifulSoup

f  = urllib2.urlopen("http://example.com")
soup = BeautifulSoup(f) # trust BeautifulSoup to parse the encoding
for meta in soup.findAll('meta', attrs={
    'http-equiv': lambda x: x and x.lower() == 'content-type'}):
    print("content-type: %r" % meta['content'])
    break
else:
    print('no content-type found')

#NOTE: strings in the soup are Unicode, but we can ask about charset
#      declared in the html 
print("encoding: %s" % (soup.declaredHTMLEncoding,))


yeah! any html parsing library would help.

BeautifulSoup is pure python library based on sgmllib, lxml is more efficient alternative python library written in c

Try any one of them. They will solve your problem.


I need to parse this as well (among other things) for my online http fetcher. I use lxml to parse pages and get the meta equiv headers, roughly as follows:

    from lxml.html import parse

    doc = parse(url)
    nodes = doc.findall("//meta")
    for node in nodes:
        name = node.attrib.get('name')
        id = node.attrib.get('id')
        equiv = node.attrib.get('http-equiv')
        if equiv.lower() == 'content-type':
            ... do your thing ... 

You can do a much fancier query to directly fetch the appropriate tag (by specifying the name= in the query), but in my case I'm parsing all meta tags. I'll leave this as an exercise for you, here is the relevant lxml documentation.

Beautifulsoup is considered somewhat deprecated and no longer actively developed.


Building your own HTML parser is much harder than you think, and as the previous answers I suggestion using a library to do it. But instead of BeautifulSoup and lxml I would suggest html5lib. It the parser that best mimics how a browser parses the page, for instance in respect to encoding:

Parsed trees are always Unicode. However a large variety of input encodings are supported. The encoding of the document is determined in the following way:

The encoding may be explicitly specified by passing the name of the encoding as the encoding parameter to HTMLParser.parse

If no encoding is specified, the parser will attempt to detect the encoding from a element in the first 512 bytes of the document (this is only a partial implementation of the current HTML 5 specification)

If no encoding can be found and the chardet library is available, an attempt will be made to sniff the encoding from the byte pattern

If all else fails, the default encoding (usually Windows-1252) will be used

From: http://code.google.com/p/html5lib/wiki/UserDocumentation

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜