Parsing HTML: lxml error in Python
I am writing a simple script to fetch the big grey table from here.
The code I have is the following:
import urllib2
from lxml import etree
html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read()
root = etree.XML(html)
But I am开发者_运维问答 getting an error on the last statement.
Traceback (most recent call last):
File "D:\Workspace\afi100\afi100.py", line 13, in <module>
root = etree.XML(html)
File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577)
File "parser.pxi", line 1556, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79602)
File "parser.pxi", line 1435, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78449)
File "parser.pxi", line 943, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75099)
File "parser.pxi", line 547, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71467)
File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72340)
File "parser.pxi", line 568, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71683)
XMLSyntaxError: Space required after the Public Identifier, line 3, column 59
Any idea how can I get around this error?
Thanks.
You're trying to parse HTML with the XML parser, you should use the lxml HTML parser.
import urllib2
from StringIO import StringIO
from lxml import etree
ufile = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx")
root = etree.parse(ufile, etree.HTMLParser())
print etree.tostring(root)
The document you link to is not well-formed XHTML, therefore you can't use an XML parser to load it.
You have to use an HTML parser like Beautiful Soup instead.
精彩评论