In Python, how do I remove the "root" tag in an HTML snippet?

2023-01-02 15:38 问答作者：

Suppose I have an HTML snippet like this:

<div>
  Hello <strong>There</strong>
  <div>I think <em>I am</em> feeing better!</div>
  <div>Don't you?</div>
  Yup!
</div>

What's the best/most robust way to remove the surrounding root element, so it looks like this:

Hello <strong>There</strong>
<div>I think <em>I am</em> feeing better!</div>
<div>Don't you?</div>
Yup!

I've tried using lxml.html like this:

lxml.htm开发者_运维问答l.fromstring(fragment_string).drop_tag()

But that only gives me "Hello", which I guess makes sense. Any better ideas?

This is a bit odd in lxml (or ElementTree). You'd have to do:

def inner_html(el):
    return (el.text or '') + ''.join(tostring(child) for child in el)

Note that lxml (and ElementTree) have no special way to represent a document except rooted with a single element, but .drop_tag() would work like you want if that <div> wasn't the root element.

You can use BeautifulSoup package. For this particular html I would go like this:

import BeautifulSoup

html = """<div>
  Hello <strong>There</strong>
  <div>I think <em>I am</em> feeing better!</div>
  <div>Don't you?</div>
  Yup!
</div>"""

bs = BeautifulSoup.BeautifulSoup(html)

no_root = '\n'.join(map(unicode, bs.div.contents))

BeautifulSoup has many nice features that will allow you to tweak this example for many other cases. Full documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html.

For such a simple task you can use regexp like r'<(.*?)>(.*)</\1>' and get match #2 (\2 in perl terms) from it

You should also put flags like ms for correct multi-line working

继续阅读：python

In Python, how do I remove the "root" tag in an HTML snippet?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？