Python ElementTree support for parsing unknown XML entities?

2023-03-31 09:14 问答作者：

I have a set of super simple XML files to parse... but... they use custom defined entities. I don't need to map these to characters, but I do wish to parse and act on each one. For example:

<Style name="admin-5678">
    <Rule>
      <Filter>[admin_level]='5'</Filter>
      &maxscale_zoom11;
    </Rule>
</Style>

There is a tantalizing hint at http://effbot.org/elementtree/elementtree-xmlparser.htm that XMLParser has limited entity support, but I can't find the methods mentioned, everything gives errors:

    #!/usr/bin/python
    ##
    ## Where's the entity support as documented at:
    ## http://effbot.org/elementtree/elementtree-xmlparser.htm
    ## In Python 2.7.1+ ?
    ##
    from pprint     import pprint
    from xml.etree  import ElementTree
    from cStringIO  import StringIO

    parser = ElementTree.ElementTree()
   #parser.entity["maxscale_zoom11"] = unichr(160)
    testf = StringIO('<foo>&maxscale_zoom11;</foo>')
    tree = parser.parse(testf)
   #tree = parser.parse(testf,"XMLParser")
    for node in tree.iter('foo'):
        print node.text

Which depending on how you adjust the comments gives:

xml.etree.ElementTree.ParseError: undefined entity: line 1, column 5

AttributeError开发者_开发问答: 'ElementTree' object has no attribute 'entity'

AttributeError: 'str' object has no attribute 'feed'

For those curious the XML is from the OpenStreetMap's mapnik project.

As @cnelson already pointed out in a comment, the chosen solution here won't work in Python 3.

I finally got it working. Quoted from this Q&A.

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''

magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY nbsp ' '>
            ]>'''  # You can define more entities here, if needed

et = ET.fromstring(magic + html)

I'm not sure if this is a bug in ElementTree or what, but you need to call UseForeignDTD(True) on the expat parser to behave the way it did in the past.

It's a bit hacky, but you can do this by creating your own instance of ElementTree.Parser, calling the method on it's instance of xml.parsers.expat, and then passing it to ElementTree.parse():

from xml.etree  import ElementTree
from cStringIO  import StringIO


testf = StringIO('<foo>&moo_1;</foo>')

parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity['moo_1'] = 'MOOOOO'

etree = ElementTree.ElementTree()

tree = etree.parse(testf, parser=parser)

for node in tree.iter('foo'):
    print node.text

This outputs "MOOOOO"

Or using a mapping interface:

from xml.etree  import ElementTree
from cStringIO  import StringIO

class AllEntities:
    def __getitem__(self, key):
        #key is your entity, you can do whatever you want with it here
        return key

testf = StringIO('<foo>&moo_1;</foo>')

parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = AllEntities()

etree = ElementTree.ElementTree()

tree = etree.parse(testf, parser=parser)

for node in tree.iter('foo'):
    print node.text

This outputs "moo_1"

A more complex fix would be to subclass ElementTree.XMLParser and fix it there.

继续阅读：openstreetmap python xml

Python ElementTree support for parsing unknown XML entities?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？