How to get `http-equiv`s in python?
I am using urllib2.urlopen
to fetch a URL and get header information like 'charset', 'content-length'.
But some page set their charset with something like
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
And urllib2
doesn't parse this for me.
Is there any built-in tool I can use to get http-equiv
information?
EDIT:
This is what I do to parse charset
from a page
elem = lxml.html.fromstring(page_source)
content_type = elem.xpath(
".//meta[translate(@http-equiv, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')='content-type']/@content")
if content_type:
content_type = content_type[0]
for frag in content_type.split(';'):
frag = frag.strip().lower(开发者_开发百科)
i = frag.find('charset=')
if i > -1:
return frag[i+8:] # 8 == len('charset=')
return None
How can I improve this? Can I precompile the xpath query?
Find 'http-equiv' using BeautifulSoup
import urllib2
from BeautifulSoup import BeautifulSoup
f = urllib2.urlopen("http://example.com")
soup = BeautifulSoup(f) # trust BeautifulSoup to parse the encoding
for meta in soup.findAll('meta', attrs={
'http-equiv': lambda x: x and x.lower() == 'content-type'}):
print("content-type: %r" % meta['content'])
break
else:
print('no content-type found')
#NOTE: strings in the soup are Unicode, but we can ask about charset
# declared in the html
print("encoding: %s" % (soup.declaredHTMLEncoding,))
yeah! any html parsing library would help.
BeautifulSoup
is pure python library based on sgmllib,
lxml
is more efficient alternative python library written in c
Try any one of them. They will solve your problem.
I need to parse this as well (among other things) for my online http fetcher. I use lxml to parse pages and get the meta equiv headers, roughly as follows:
from lxml.html import parse
doc = parse(url)
nodes = doc.findall("//meta")
for node in nodes:
name = node.attrib.get('name')
id = node.attrib.get('id')
equiv = node.attrib.get('http-equiv')
if equiv.lower() == 'content-type':
... do your thing ...
You can do a much fancier query to directly fetch the appropriate tag (by specifying the name= in the query), but in my case I'm parsing all meta tags. I'll leave this as an exercise for you, here is the relevant lxml documentation.
Beautifulsoup is considered somewhat deprecated and no longer actively developed.
Building your own HTML parser is much harder than you think, and as the previous answers I suggestion using a library to do it. But instead of BeautifulSoup and lxml I would suggest html5lib. It the parser that best mimics how a browser parses the page, for instance in respect to encoding:
Parsed trees are always Unicode. However a large variety of input encodings are supported. The encoding of the document is determined in the following way:
The encoding may be explicitly specified by passing the name of the encoding as the encoding parameter to HTMLParser.parse
If no encoding is specified, the parser will attempt to detect the encoding from a element in the first 512 bytes of the document (this is only a partial implementation of the current HTML 5 specification)
If no encoding can be found and the chardet library is available, an attempt will be made to sniff the encoding from the byte pattern
If all else fails, the default encoding (usually Windows-1252) will be used
From: http://code.google.com/p/html5lib/wiki/UserDocumentation
精彩评论