开发者

how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?

How can one tell etree.strip_tags() to strip all possible tags from a given tag element?

Do I have to map them myself, like:

STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
                           # that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)

Perhaps a more elegant approach I don't know of?

Example input:

parent_tag 开发者_StackOverflow中文版= "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"

Desired Output:

# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

or even better:

This is some text with multiple tags and sometimes they are nested.


You can use the lxml.html.clean module:

import lxml.html, lxml.html.clean


s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)

print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>


This answer is a bit late, but I guess a simpler solution than the one provided by the initial answer by ars might be handy for safekeeping's sake.

Short Answer

Use the "*" argument when you call strip_tags() to specify all tags to be stripped.

Long Answer

Given your XML string, we can create an lxml Element:

>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)

You can inspect that instance like so:

>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'

To strip out all the tags except the parent tag itself, use the etree.strip_tags() function like you suggested, but with a "*" argument:

>>> lxml.etree.strip_tags(parent_tag, "*")

Inspection shows that all child tags are gone:

>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'

Which is your desired output. Note that this will modify the lxml Element instance itself! To make it even better (as you asked :-)) just grab the text property:

>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜