Fixing tostring() in Python's lxml

2023-02-02 11:23 问答作者：

lxml's tostring() function seems quite broken when printing only parts of documents. Witness:

from lxml.html import fragment_fromstring, tostring
frag = fragment_fromstring('<p>This stuff is <em>really</em> great!')
em = frag.cssselect('em').pop(0)
print tostring(em)

I expect <em>really</em> but instead it prints <em>really</em> great! which is wrong. The ' great !' is not part of the selected em. It's not only wrong, it's a pill, at least for processing document-structured X开发者_JS百科ML, where such trailing text will be common.

As I understand it, lxml stores any free text that comes after the current element in the element's .tail attribute. A scan of the code for tostring() brings me to ElementTree.py's _write() function, which clearly always prints the tail. That's correct behavior for whole trees, but not on the last element when rendering a subtree, yet it makes no distinction.

To get a proper tail-free rendering of the selected XML, I tried writing a toxml() function from scratch to use in its place. It basically worked, but there are many special cases in handling comments, processing instructions, namespaces, encodings, yadda yadda. So I changed gears and now just piggyback tostring(), post-processing its output to remove the offending .tail text:

def toxml(e):
    """ Replacement for lxml's tostring() method that doesn't add spurious
    tail text. """

    from lxml.etree import tostring
    xml = tostring(e)
    if e.tail:
        xml = xml[:-len(e.tail)]
    return xml

A basic series of tests shows this works nicely.

Critiques and/or suggestions?

How about xml = lxml.etree.tostring(e, with_tail=False)?

from lxml.html import fragment_fromstring
from lxml.etree import tostring
frag = fragment_fromstring('<p>This stuff is <em>really</em> great!')
em = frag.cssselect('em').pop(0)
print tostring(em, with_tail=False)

Looks like with_tail was added in v2.0; do you have an older version?

继续阅读：lxml python xml

Fixing tostring() in Python's lxml

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？