How to get `http-equiv`s in python?

2023-01-28 16:42 问答作者：

I am using urllib2.urlopen to fetch a URL and get header information like 'charset', 'content-length'.

But some page set their charset with something like

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

And urllib2 doesn't parse this for me.

Is there any built-in tool I can use to get http-equiv information?

EDIT:

This is what I do to parse charset from a page

elem = lxml.html.fromstring(page_source)
content_type = elem.xpath(
        ".//meta[translate(@http-equiv, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')='content-type']/@content")
if content_type:
    content_type = content_type[0]
    for frag in content_type.split(';'):
        frag = frag.strip().lower(开发者_开发百科)
        i = frag.find('charset=')
        if i > -1:
            return frag[i+8:] # 8 == len('charset=')

return None

How can I improve this? Can I precompile the xpath query?

Find 'http-equiv' using BeautifulSoup

import urllib2
from BeautifulSoup import BeautifulSoup

f  = urllib2.urlopen("http://example.com")
soup = BeautifulSoup(f) # trust BeautifulSoup to parse the encoding
for meta in soup.findAll('meta', attrs={
    'http-equiv': lambda x: x and x.lower() == 'content-type'}):
    print("content-type: %r" % meta['content'])
    break
else:
    print('no content-type found')

#NOTE: strings in the soup are Unicode, but we can ask about charset
#      declared in the html 
print("encoding: %s" % (soup.declaredHTMLEncoding,))

yeah! any html parsing library would help.

BeautifulSoup is pure python library based on sgmllib, lxml is more efficient alternative python library written in c

Try any one of them. They will solve your problem.

I need to parse this as well (among other things) for my online http fetcher. I use lxml to parse pages and get the meta equiv headers, roughly as follows:

    from lxml.html import parse

    doc = parse(url)
    nodes = doc.findall("//meta")
    for node in nodes:
        name = node.attrib.get('name')
        id = node.attrib.get('id')
        equiv = node.attrib.get('http-equiv')
        if equiv.lower() == 'content-type':
            ... do your thing ...

You can do a much fancier query to directly fetch the appropriate tag (by specifying the name= in the query), but in my case I'm parsing all meta tags. I'll leave this as an exercise for you, here is the relevant lxml documentation.

Beautifulsoup is considered somewhat deprecated and no longer actively developed.

Building your own HTML parser is much harder than you think, and as the previous answers I suggestion using a library to do it. But instead of BeautifulSoup and lxml I would suggest html5lib. It the parser that best mimics how a browser parses the page, for instance in respect to encoding:

Parsed trees are always Unicode. However a large variety of input encodings are supported. The encoding of the document is determined in the following way:

The encoding may be explicitly specified by passing the name of the encoding as the encoding parameter to HTMLParser.parse

If no encoding is specified, the parser will attempt to detect the encoding from a element in the first 512 bytes of the document (this is only a partial implementation of the current HTML 5 specification)

If no encoding can be found and the chardet library is available, an attempt will be made to sniff the encoding from the byte pattern

If all else fails, the default encoding (usually Windows-1252) will be used

From: http://code.google.com/p/html5lib/wiki/UserDocumentation

继续阅读：http python

How to get `http-equiv`s in python?

Find 'http-equiv' using BeautifulSoup

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Find 'http-equiv' using BeautifulSoup

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？