Preserving Escaped Characters in Python XML Parsing
I'm trying to write a python script that takes in one or two xml files and outputs one or two new files based on the contents of the input files. I was trying to write this script using the minidom module. However, the input files contain a number of instances of the escape character
inside node attributes. Unfortunately, in the output files, these characters have been converted to different characters, which seem to be newline characters.
For example, a line in the input file such as:
<Entry text="For English For Hearing Impaired
Press 3 on Keypad"
Would be output as
<Entry text="For English For Hearing Impaired
Press 3 on Keypad"
I read that minidom is causing this, as it doesn't allow escape characters in xml attributes (I think). Is this true? And, if so, what's the best tool/method to use to parse an xml file into a python docu开发者_JAVA百科ment, manipulate nodes and exchange them with other documents, and output documents back to new files?
If it helps, I was also parsing and saving these files using 'utf-8' encoding. I don't know if this is part of the problem or not. Thanks for any help anyone can give.
-Alex Kaiser
I haven't used Python's standard xml modules since discovering lxml. It can do everything you're looking for. For example...
input.xml:
<?xml version="1.0" encoding='utf-8'?>
<root>
<Button3 yposition="250" fontsize="16" language1="For English For Hearing Impaired
Press 3 on Keypad" />
</root>
and:
>>> from lxml import etree
>>> with open('input.xml') as f:
... root = etree.parse(f)
...
>>> buttons = root.xpath('//Button3')
>>> buttons
[<Element Button3 at 101071f18>]
>>> buttons[0]
<Element Button3 at 101071f18>
>>> buttons[0].attrib
{'yposition': '250', 'language1': 'For English For Hearing Impaired\nPress 3 on Keypad', 'fontsize': '16'}
>>> buttons[0].attrib['foo'] = 'bar'
>>> s = etree.tostring(root, xml_declaration=True, encoding='utf-8', pretty_print=True)
>>> print(s)
<?xml version='1.0' encoding='utf-8'?>
<root>
<Button3 yposition="250" fontsize="16" language1="For English For Hearing Impaired Press 3 on Keypad" foo="bar"/>
</root>
>>> with open('output.xml','w') as f:
... f.write(s)
>>>
Unfortunately, standard xml
module doesn't have option to turn off escaping. So, for me best option was to escape it back
using method from ElementTree
that is used by xml
itself for this purpose (method from sax.utils
doesn't escape \n
):
text = ElementTree._escape_attrib(text, 'utf-8')
Text in source xml:
Here is a test message With newline & ampersand
Text after "decoding":
Here is a test message
With newline & ampersand
Text after "escaping back":
Here is a test message With newline & ampersand


is the XML entity for character 0x0a, or a newline. The parser is correctly parsing the XML and giving you the characters indicated. If you want to forbid or otherwise deal with newlines in attributes, you are free to do whatever you like with them after the parser gives them to you.
精彩评论