Escape unescaped characters in XML with Python

2023-02-09 20:12 问答作者：

I need to escape special characters in an invalid XML file which is about 5000 lines long. Here's an example of the XML that I have to deal with:

<root>
 <el开发者_运维百科ement>
  <name>name & surname</name>
  <mail>name@name.org</mail>
 </element>
</root>

Here the problem is the character "&" in the name. How would you escape special characters like this with a Python library? I didn't find a way to do it with BeautifulSoup.

If you don't care about invalid characters in the xml you could use XML parser's recover option (see Parsing broken XML with lxml.etree.iterparse):

from lxml import etree

parser = etree.XMLParser(recover=True) # recover from bad characters.
root = etree.fromstring(broken_xml, parser=parser)
print etree.tostring(root)

Output

<root>
<element>
<name>name  surname</name>
<mail>name@name.org</mail>
</element>
</root>

You're probably just wanting to do some simple regexp-ery on the HTML before throwing it into BeautifulSoup.

Even simpler, if there aren't any SGML entities (&...;) in the code, html=html.replace('&','&') will do the trick.

Otherwise, try this:

x ="<html><h1>Fish & Chips & Gravy</h1><p>Fish &amp; Chips &#x0026; Gravy</p>"
import re
q=re.sub(r'&([^a-zA-Z#])',r'&amp;\1',x)
print q

Essentially the regex looks for & not followed by alpha-numeric or # characters. It won't deal with ampersands at the end of lines, but that's probably fixable.

<name>name & surname</name>

is not well-formed XML. It should be:

<name>name &amp; surname</name>

All conformant XML tools should create this - you normally do not have to worry. If you create a string with the '&' character then an XML tool will output the escaped version. If you create the string by hand it is your responsibility to make sure it is escaped. If you use an XML editor it should escape it for you.

If the file has been given you by someone else, send it back and tell them it is not well-formed. If they no longer exist you will have to use a plain text editor. That's fragile and messy but there is no other way. If the file has ampersands elsewhere that are used for escaping then the file is garbage.

See a 10-year-old post here and a later one here.

This answer provides XML sanitizer functions, although they don't escape the unescaped characters, but simply drop them instead.

Using bs4 with lxml

The question wondered how to do it with Beautiful Soup. Here is a function which will sanitize a small XML bytes object with it. It was tested with the package requirements beautifulsoup4==4.8.0 and lxml==4.4.0. Note that lxml is required here by bs4.

import xml.etree.ElementTree

import bs4


def sanitize_xml(content: bytes) -> bytes:
    # Ref: https://stackoverflow.com/a/57450722/
    try:
        xml.etree.ElementTree.fromstring(content)
    except xml.etree.ElementTree.ParseError:
        return bs4.BeautifulSoup(content, features='lxml-xml').encode()
    return content  # already valid XML

Using only lxml

Obviously there is not much of a point in using both bs4 and lxml when this can be done with lxml alone. This lxml==4.4.0 using sanitizer function is essentially derived from the answer by jfs.

import lxml.etree


def sanitize_xml(content: bytes) -> bytes:
    # Ref: https://stackoverflow.com/a/57450722/
    try:
        lxml.etree.fromstring(content)
    except lxml.etree.XMLSyntaxError:
        root = lxml.etree.fromstring(content, parser=lxml.etree.XMLParser(recover=True))
        return lxml.etree.tostring(root)
    return content  # already valid XML

继续阅读：lxml python special-characters xml

Escape unescaped characters in XML with Python

Output

Using bs4 with lxml

Using only lxml

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Output

Using bs4 with lxml

Using only lxml

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？