开发者

How to parse unicode strings with minidom?

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)

It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow s开发者_运维知识库trip/replace all those non-English characters before being able to parse the XML files?


Try to decode it:

> print u'abcdé'.encode('utf-8')
> abcdé

> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé


In case your string is 'str':

xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))

This worked for me.


Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.

If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.


I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.

My code which was throwing the UnicodeEncodeError looked like:

with tempfile.NamedTemporaryFile(delete=False) as fh:
    dom.writexml(fh, encoding="utf-8")

Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:

with tempfile.NamedTemporaryFile(delete=False) as fh:
    fh = codecs.lookup("utf-8")[3](fh)
    dom.writexml(fh, encoding="utf-8")

This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).


I encounter this error a few times, and my hacky way of dealing with it is just to do this:

def getCleanString(word):   
   str = ""
   for character in word:
      try: 
         str_character = str(character)
         str = str + str_character
      except:
         dummy = 1 # this happens if character is unicode
   return str

Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜