python ElementTree write function

2023-04-03 19:44 问答作者：

I am using python ElementTree to read a开发者_Python百科nd modify some content of my html files. When I am done with changes and use ElementTree.write function,

1) it adds extra html: infront of all the tags. How should I avoid that?

2) It also adds & where I have special characters. How should i avoid that?

Thank you, Divya.

You can't. ElementTree works by loading the XML, parsing it, and only storing an abstract representation. It writes that out to a string by walking the abstract representation, but it doesn't remember things like which characters were escaped as entities, or whether an element was stored as <foo/> or <foo></foo> (HTML: <foo> or <foo></foo>)

Now, since ElementTree only works with XML (not HTML), I'm guessing you're working with lxml.html -- in this case, it in fact automatically corrects certain forms of erroneous HTML, because otherwise it wouldn't be able to store it correctly.

The right way to handle HTML whose data you want to be completely preserved except how you alter it, is to grab it in tokens that remember their original representation. I've done this using sgmllib, but this is imperfect -- e.g. there's a get_starttag_text method for getting the exact content of a start tag, but no corresponding method for end tags. It might be good enough anyway.

For example, to write out HTML where all the paragraphs are removed, one might write the function like this:

from cStringIO import StringIO

class SGMLModifier(sgmllib.SGMLParser):
    def __init__(self, *args, **kwargs):
        sgmllib.SGMLParser.__init__(self, *args, **kwargs)
        self._file = StringIO()

    def getvalue(self):
        return self._file.getvalue()

    def start_b(self, attributes):
        # skip it
        pass

    def end_b(self):
        # skip it
        pass

    def unknown_starttag(self, tag, attributes):
        self._file.write(self.get_starttag_text())

    def unknown_endtag(self, tag):
        # we can't get this verbatim.
        self._file.write('</%s>' % tag)

    def handle_comment(self, comment):
        # no verbatim here either.
        self._file.write('<!-- %s -->' % comment)

    def handle_data(self, data):
        self._file.write(data)

    def convert_entityref(self, ref):
        return '&' + ref + ';'

def remove_bold(html):
    parser = SGMLModifier()
    parser.feed(html)
    return parser.getvalue()

This might need a bit more work to not mangle the input. Check the documentation for details on everything.

继续阅读：elementtree python

python ElementTree write function

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？