开发者

Parsing unicode XML with Python SAX on App Engine

I'm using xml.sax with unicode strings of XML as input, originally entered in from a web form. On my local machine (python 2.5, using the default xmlreader expat, running through app 开发者_StackOverflow社区engine), it works fine. However, the exact same code and input strings on production app engine servers fail with "not well-formed". For example, it happens with the code below:

from xml import sax
class MyHandler(sax.ContentHandler):
  pass

handler = MyHandler()
# Both of these unicode strings return 'not well-formed' 
# on app engine, but work locally
xml.parseString(u"<a>b</a>",handler) 
xml.parseString(u"<!DOCTYPE a[<!ELEMENT a (#PCDATA)> ]><a>b</a>",handler)

# Both of these work, but output unicode
xml.parseString("<a>b</a>",handler) 
xml.parseString("<!DOCTYPE a[<!ELEMENT a (#PCDATA)> ]><a>b</a>",handler)

resulting in the error:

  File "<string>", line 1, in <module>
  File "/base/python_dist/lib/python2.5/xml/sax/__init__.py", line 49, in parseString
    parser.parse(inpsrc)
  File "/base/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/base/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/base/python_dist/lib/python2.5/xml/sax/expatreader.py", line 211, in feed
    self._err_handler.fatalError(exc)
  File "/base/python_dist/lib/python2.5/xml/sax/handler.py", line 38, in fatalError
    raise exception
SAXParseException: <unknown>:1:1: not well-formed (invalid token)

Any reason why app engine's parser, which also uses python2.5 and expat, would fail when inputting unicode?


You are not supposed to parse a unicode string, you should parse a UTF-8 encoded string. A unicode string is not a well-formed XML by default, according to XML 1.0 specification. So you need to convert unicode to UTF-8 encoding before feeding it to the parser.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜