Escaping strings for use in XML

2022-12-08 08:57 问答作者：

I'm using Py开发者_StackOverflow社区thon's xml.dom.minidom to create an XML document. (Logical structure -> XML string, not the other way around.)

How do I make it escape the strings I provide so they won't be able to mess up the XML?

Something like this?

>>> from xml.sax.saxutils import escape
>>> escape("< & >")   
'&lt; &amp; &gt;'

xml.sax.saxutils does not escape quotation characters (")

So here is another one:

def escape( str_xml: str ):
    str_xml = str_xml.replace("&", "&amp;")
    str_xml = str_xml.replace("<", "&lt;")
    str_xml = str_xml.replace(">", "&gt;")
    str_xml = str_xml.replace("\"", "&quot;")
    str_xml = str_xml.replace("'", "&apos;")
    return str_xml

if you look it up then xml.sax.saxutils only does string replace

xml.sax.saxutils.escape only escapes &, <, and > by default, but it does provide an entities parameter to additionally escape other strings:

from xml.sax.saxutils import escape

def xmlescape(data):
    return escape(data, entities={
        "'": "&apos;",
        "\"": "&quot;"
    })

xml.sax.saxutils.escape uses str.replace() internally, so you can also skip the import and write your own function, as shown in MichealMoser's answer.

Do you mean you do something like this:

from xml.dom.minidom import Text, Element

t = Text()
e = Element('p')

t.data = '<bar><a/><baz spam="eggs"> & blabla &entity;</>'
e.appendChild(t)

Then you will get nicely escaped XML string:

>>> e.toxml()
'<p>&lt;bar&gt;&lt;a/&gt;&lt;baz spam=&quot;eggs&quot;&gt; &amp; blabla &amp;entity;&lt;/&gt;</p>'

The accepted answer from Andrey Vlasovskikh is the most complete answer to the OP. But this topic comes up for most frequent searches for python escape xml and I wanted to offer a time comparison of the three solutions discussed in this article along with offering a fourth option we choose to deploy due to the enhanced performance it offered.

All four rely on either native python data handling, or python standard library. The solutions are offered in order from the slowest to the fastest performance.

Option 1 - regex

This solution uses the python regex library. It yields the slowest performance:

import re
table = {
    "<": "&lt;",
    ">": "&gt;",
    "&": "&amp;",
    "'": "&apos;",
    '"': "&quot;",
}
pat = re.compile("({})".format("|".join(table)))

def xmlesc(txt):
    return pat.sub(lambda match: table[match.group(0)], txt)

>>> %timeit xmlesc('<&>"\'')
1.48 µs ± 1.73 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

FYI: µs is the symbol for microseconds, which is 1-millionth of second. The other implementation's completion times are measured in nanoseconds (ns) which is billionths of a second.

Option 2 -- xml.sax.saxutils

This solution uses python xml.sax.saxutils library.

from xml.sax.saxutils import escape
def xmlesc(txt):
    return escape(txt, entities={"'": "&apos;", '"': "&quot;"})

>>> %timeit xmlesc('<&>"\'')
832 ns ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Option 3 - str.replace

This solution uses the string replace() method. Under the hood, it implements similar logic as python's xml.sax.saxutils. The saxutils code has a for loop that costs some performance, making this version slightly faster.

def xmlesc(txt):
    txt = txt.replace("&", "&amp;")
    txt = txt.replace("<", "&lt;")
    txt = txt.replace(">", "&gt;")
    txt = txt.replace('"', "&quot;")
    txt = txt.replace("'", "&apos;")
    return txt

>>> %timeit xmlesc('<&>"\'')
503 ns ± 0.725 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Option 4 - str.translate

This is the fastest implementation. It uses the string translate() method.

table = str.maketrans({
    "<": "&lt;",
    ">": "&gt;",
    "&": "&amp;",
    "'": "&apos;",
    '"': "&quot;",
})
def xmlesc(txt):
    return txt.translate(table)

>>> %timeit xmlesc('<&>"\'')
352 ns ± 0.177 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

If you don't want another project import and you already have cgi, you could use this:

>>> import cgi
>>> cgi.escape("< & >")
'&lt; &amp; &gt;'

Note however that with this code legibility suffers - you should probably put it in a function to better describe your intention: (and write unit tests for it while you are at it ;)

def xml_escape(s):
    return cgi.escape(s) # escapes "<", ">" and "&"

xml_special_chars = {
    "<": "&lt;",
    ">": "&gt;",
    "&": "&amp;",
    "'": "&apos;",
    '"': "&quot;",
}

xml_special_chars_re = re.compile("({})".format("|".join(xml_special_chars)))

def escape_xml_special_chars(unescaped):
    return xml_special_chars_re.sub(lambda match: xml_special_chars[match.group(0)], unescaped)

All the magic happens in re.sub(): argument repl accepts not only strings, but also functions.

继续阅读：escaping python security xml

Escaping strings for use in XML

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？