Python's libxml2 can't parse unicode strings
OK, the docs for Python's libxml2 bindings are really ****
. My problem:
An XML document is stored in a string variable in Python. The string is a instance of Unicode, and there are non-ASCII characters in it. I want to parse it with libxml2, looking something like this:
# -*- coding: utf-8 -*-
import libxml2
DOC = u"""&开发者_运维百科lt;?xml version="1.0" encoding="UTF-8"?>
<data>
<something>Bäääh!</something>
</data>
"""
xml_doc = libxml2.parseDoc(DOC)
with this result:
Traceback (most recent call last):
File "test.py", line 13, in <module>
xml_doc = libxml2.parseDoc(DOC)
File "c:\Python26\lib\site-packages\libxml2.py", line 1237, in parseDoc
ret = libxml2mod.xmlParseDoc(cur)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 46-48:
ordinal not in range(128)
The point is the u"..."
declaration. If I replace it with a simple ".."
, then everything is ok. Unfortunately it doesn't work in my setup, because DOC
will definitely be a Unicode instance.
Has anyone an idea how libxml2 can be brought to parse UTF-8 encoded strings?
It should be
# -*- coding: utf-8 -*-
import libxml2
DOC = u"""<?xml version="1.0" encoding="UTF-8"?>
<data>
<something>Bäääh!</something>
</data>
""".encode("UTF-8")
xml_doc = libxml2.parseDoc(DOC)
The .encode("UTF-8") is needed to get the binary representation of the unicode string with the utf8 encoding.
XML is a binary format, despite of looking like a text. An encoding is specified in the beginning of the XML file in order to decode the XML bytes into the text.
What you should do is to pass str
, not unicode
to your library:
xml_doc = libxml2.parseDoc(DOC.encode("UTF-8"))
(Although some tricks are possible with site.setencoding
if you are interested in reading or writing unicode
strings with automatic conversion via locale
.)
Edit: The Unicode article by Joel Spolsky is good guide to string characters vs. bytes, encodings, etc.
精彩评论