开发者

xml.dom.minidom.parse() failing when XML attribute contains unicode

I'm querying a web service using urllib2.request and receiving XML. If I violate the web service's rate limit (1 call/second), I receive HTML back saying I've violated the rate limit.

Even though I can time.sleep() for 2-3 seconds after each call, I still, for whatever reason, violate the rate limit.

To test that my response is either XML or HTML, I'm using xml.dom.minidom() and then testing for the presence of an html element

try:
    dom = xml.dom.minidom.parseString(response_text)
  except xml.parsers.expat.ExpatError:
    return False

  if len(dom.getElementsByTagName('html')) == 0:
    return True
  else:
    return False

This gets the job done but I've run into a case where one of the XML attributes contains XML. In that case, the parseString() command fails with

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/minidom.py", line 1918, in     parse
    return expatbuilder.parse(file)
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 924, in parse
    result = builder.parseFile(fp)
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid to开发者_JAVA技巧ken): line 1, column 3125

In this case, column 3125 is part of some attribute value text that contains ampersand-pound-x-9 (Stackoverflow is hiding my unicode).

Should xml.dom.minidom be able to handle this? Could there be another issue with the XML besides this that's causing the parsing to fail?

Additionally, I'm open to other ways of handling this type of situation if the community has one.

If it helps, here is what the web service returns when I've violated their rate limit:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="eng">
    <head>
        <title>Service Temporarily Unavailable - Rate Limited</title>
    </head> 
    <body style="text-align:center;background-color:white;"> 
        <h1>Service Temporarily Unavailable</h1>
        <hr />
        <div>
            You have used this service too often in a short time.  Please wait before using this service again.
            <br/><br/>
            Please visit the <a href="http://wiki.xxxx.com/index.php?title=API_Usage">wiki</a> for more details.
        </div> 
    </body> 
</html>


I think that &#x9 is a tab. You should try http://docs.python.org/library/htmllib.html#module-htmlentitydefs to convert special html entities back to whatever they are. (That may have the problem of &lt; etc). Or you could do a string substitution that substitute &#x9 with a space.

Just as a suggestion, when you're parsing stuff, and the parser runs into a problem, such as not fitting your pattern, instead of stopping the operation, you should allow the parser to continue, but spit out a warning. This way you can see what the problem is, and potentially correct it, or at least see that there's a problem.

Also as to your problem with the rate limit, why not cache the requested HTML once so you can perform processing locally.


You could also test the string for HTML before attempting to parse the result:

if response_text.lstrip().startswith('<!DOCTYPE html'):
    # we received an html response, sleep again
...

I also couldn't get minidom to blow up on an attribute containing a tab entity. Perhaps it is an improperly terminated entity sequence, like &#9 without the ending semicolon? Minidom seems okay with properly-escaped entities inside attributes:

text = '<root><a href="&#9;foo&lt;">link</a></root>'
tree = minidom.parseString(text)
print tree.toxml()

u'<?xml version="1.0" ?>\n<root><a href="\tfoo&lt;">link</a></root>'
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜