How to parse unicode strings with minidom?

2023-02-17 01:42 问答作者：

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)

It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow s开发者_运维知识库trip/replace all those non-English characters before being able to parse the XML files?

Try to decode it:

> print u'abcdé'.encode('utf-8')
> abcdÃ©

> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé

In case your string is 'str':

xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))

This worked for me.

Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.

If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.

I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.

My code which was throwing the UnicodeEncodeError looked like:

with tempfile.NamedTemporaryFile(delete=False) as fh:
    dom.writexml(fh, encoding="utf-8")
Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:
with tempfile.NamedTemporaryFile(delete=False) as fh:
    fh = codecs.lookup("utf-8")[3](fh)
    dom.writexml(fh, encoding="utf-8")

This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).

I encounter this error a few times, and my hacky way of dealing with it is just to do this:

def getCleanString(word):   
   str = ""
   for character in word:
      try: 
         str_character = str(character)
         str = str + str_character
      except:
         dummy = 1 # this happens if character is unicode
   return str

Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.

继续阅读：minidom python unicode

How to parse unicode strings with minidom?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？