Using lxml, what causes a "lxml.etree.XMLSyntaxError: Document is empty" error?
I'm using mechanize/cookiejar/lxml to read a page and it works for some but not others. The error I'm getting in them is the one i开发者_JS百科n the title. I can't post the pages here because they aren't SFW, but is there a way to fix it? Basically, this is what I do:
import mechanize, cookielib
from lxml import etree
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(False)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 maverick Firefox/3.6.13')]
response = br.open('...')
tree = etree.parse(response) #error
After that I get the root and search the document for the values I want. Apparently iterparse doesn't crash it, but at the moment I'm assuming it doesn't just because I didn't process anything with it. Plus, I haven't figured out yet how to search for the stuff with it.
I've tried disabling gzip and enabling sending the referer as well but neither solves the problem. I also tried saving the sourcecode to the disk and creating the tree from there just for the sake of it and I get the same error.
edit
The response I get seems to be fine, using print repr(response) as suggested I get a<response_seek_wrapper at 0xa4a160c whose wrapped object = <stupid_gzip_wrapper at 0xa49acec whose fp = <socket._fileobject object at 0xa49c32c>>>
. I can also save the response using the read() method and check that the saved .xml works on the browser and everything.
Also, in one of the pages, there is a ’
that gives me the following error: "lxml.etree.XMLSyntaxError: Entity 'rsquo' not defined, line 17, column 7054". So far I've replaced it with a regex, but is there a parser that can handle this? I've gotten this error even with the lxml.html.parse suggested below.
Regarding the file being highlighted, I meant that when I open it with gEdit it does this kinda: http://img34.imageshack.us/img34/9574/gedit.jpg
use lxml.html.parse for html it can handle even very broken html, you still get an error then?
What is the nature of response
? According to the help, etree.parse is expecting one of:
- a file name/path
- a file object
- a file-like object
- a URL using the HTTP or FTP protocol
精彩评论