Parsing unicode XML with Python SAX on App Engine
I'm using xml.sax with unicode strings of XML as input, originally entered in from a web form. On my local machine (python 2.5, using the default xmlreader expat, running through app 开发者_StackOverflow社区engine), it works fine. However, the exact same code and input strings on production app engine servers fail with "not well-formed". For example, it happens with the code below:
from xml import sax
class MyHandler(sax.ContentHandler):
pass
handler = MyHandler()
# Both of these unicode strings return 'not well-formed'
# on app engine, but work locally
xml.parseString(u"<a>b</a>",handler)
xml.parseString(u"<!DOCTYPE a[<!ELEMENT a (#PCDATA)> ]><a>b</a>",handler)
# Both of these work, but output unicode
xml.parseString("<a>b</a>",handler)
xml.parseString("<!DOCTYPE a[<!ELEMENT a (#PCDATA)> ]><a>b</a>",handler)
resulting in the error:
File "<string>", line 1, in <module>
File "/base/python_dist/lib/python2.5/xml/sax/__init__.py", line 49, in parseString
parser.parse(inpsrc)
File "/base/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/base/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/base/python_dist/lib/python2.5/xml/sax/expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "/base/python_dist/lib/python2.5/xml/sax/handler.py", line 38, in fatalError
raise exception
SAXParseException: <unknown>:1:1: not well-formed (invalid token)
Any reason why app engine's parser, which also uses python2.5 and expat, would fail when inputting unicode?
You are not supposed to parse a unicode string, you should parse a UTF-8 encoded string. A unicode string is not a well-formed XML by default, according to XML 1.0 specification. So you need to convert unicode to UTF-8 encoding before feeding it to the parser.
精彩评论