xml.dom.minidom.parse() failing when XML attribute contains unicode

2023-03-04 13:21 问答作者：

I'm querying a web service using urllib2.request and receiving XML. If I violate the web service's rate limit (1 call/second), I receive HTML back saying I've violated the rate limit.

Even though I can time.sleep() for 2-3 seconds after each call, I still, for whatever reason, violate the rate limit.

To test that my response is either XML or HTML, I'm using xml.dom.minidom() and then testing for the presence of an html element

try:
    dom = xml.dom.minidom.parseString(response_text)
  except xml.parsers.expat.ExpatError:
    return False

  if len(dom.getElementsByTagName('html')) == 0:
    return True
  else:
    return False

This gets the job done but I've run into a case where one of the XML attributes contains XML. In that case, the parseString() command fails with

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/minidom.py", line 1918, in     parse
    return expatbuilder.parse(file)
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 924, in parse
    result = builder.parseFile(fp)
  File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid to开发者_JAVA技巧ken): line 1, column 3125

In this case, column 3125 is part of some attribute value text that contains ampersand-pound-x-9 (Stackoverflow is hiding my unicode).

Should xml.dom.minidom be able to handle this? Could there be another issue with the XML besides this that's causing the parsing to fail?

Additionally, I'm open to other ways of handling this type of situation if the community has one.

If it helps, here is what the web service returns when I've violated their rate limit:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="eng">
    <head>
        <title>Service Temporarily Unavailable - Rate Limited</title>
    </head> 
    <body style="text-align:center;background-color:white;"> 
        <h1>Service Temporarily Unavailable</h1>
        <hr />
        <div>
            You have used this service too often in a short time.  Please wait before using this service again.
            <br/><br/>
            Please visit the <a href="http://wiki.xxxx.com/index.php?title=API_Usage">wiki</a> for more details.
        </div> 
    </body> 
</html>

I think that &#x9 is a tab. You should try http://docs.python.org/library/htmllib.html#module-htmlentitydefs to convert special html entities back to whatever they are. (That may have the problem of < etc). Or you could do a string substitution that substitute &#x9 with a space.

Just as a suggestion, when you're parsing stuff, and the parser runs into a problem, such as not fitting your pattern, instead of stopping the operation, you should allow the parser to continue, but spit out a warning. This way you can see what the problem is, and potentially correct it, or at least see that there's a problem.

Also as to your problem with the rate limit, why not cache the requested HTML once so you can perform processing locally.

You could also test the string for HTML before attempting to parse the result:

if response_text.lstrip().startswith('<!DOCTYPE html'):
    # we received an html response, sleep again
...

I also couldn't get minidom to blow up on an attribute containing a tab entity. Perhaps it is an improperly terminated entity sequence, like &#9 without the ending semicolon? Minidom seems okay with properly-escaped entities inside attributes:

text = '<root><a href="&#9;foo&lt;">link</a></root>'
tree = minidom.parseString(text)
print tree.toxml()

u'<?xml version="1.0" ?>\n<root><a href="\tfoo&lt;">link</a></root>'

继续阅读：minidom parsing python web-services xml

xml.dom.minidom.parse() failing when XML attribute contains unicode

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？