python remove text inside <p>

2023-02-25 21:46 问答作者：

I want to remove text inside <p> tags for a block of html text. I am trying to standardize some text and remove all class, align, and other information. Every example I can find seems to deal with stripping html, and I don't want to strip the tags. I just want to make them all plain.

So if I have something like this:

<p class='MsoBodyText' align='left'>开发者_如何学C;
some paragraph blah blah blah
</p>

<p class='SomeClassIDontWant' align='right'>
some other paragraph blah blah blah
</p>

I want to return:

<p>
some paragraph blah blah blah
</p>

<p>
some other paragraph blah blah blah
</p>

Use a library for parsing HTML such as Beautiful Soup or a similar alternative. Regex is not powerful enough to correctly parse HTML.

@Mark made a valid point that in this particular case a simple regex should work because you are not doing full parsing with tag matching etc. I still think it's a good practice to familiarize yourself with these parsing libraries when you find yourself needing more complex operations.

<p title="1 > 0">Test</p>

I believe is valid html. At the very least Chrome accepts it, and I'm sure other browsers do as well.

Using BeautifulSoup is quite easy, you create a BeautifulSoup element from the string and then for each element in that object you set the attribute list to an empty list just like this:

from BeautifulSoup import *
parsed_html = BeautifulSoup(your_html)
for elem in parsed_html:
   if not isinstance(elem, NavigableString): #You need to know that it is a node and not text
       elem.attrs = []
print parsed_html # It is clean now

For more information about BeautifulSoup you can see the BeautifulSoup documentation

Regex will miss in case of delimiters etc. You should use an HTML parser, most common one being beautiful soup.

Also note that you need to handle Unicode as well as simple str.

Here is a solution from me:

from BeautifulSoup import BeautifulSoup, Tag

def clear_p_tags(html_str):
    """ Works well both for unicode as well as str """
    html = BeautifulSoup(html_str)

    for elem in parsed_html:
        if type(elem) is Tag: elem.attrs = []
    return type(html_str)(html)


def test_p_clear(str_data):

    html_str = data
    html_unicode = unicode(data)

    clear_p_html_str = clear_p_tags(html_str)
    clear_p_html_unicode = clear_p_tags(html_unicode)

    print type(clear_p_html_str)
    print clear_p_html_str

    print type(clear_p_html_unicode)
    print clear_p_html_unicode

data = """
<a href="hello.txt"> hello </a>
<p class='MsoBodyText' align='left'>
some paragraph blah blah blah
</p>

<p class='SomeClassIDontWant' align='right'>
some other paragraph blah blah blah
</p>
"""

test_p_clear(data)

I am all for Davy8's answer. You might also look into lxml.html.

If you still want to use regular expressions... you should use something like:

re.sub(r'<p [^>]*?>', r'<p>', foo)

继续阅读：python regex string

python remove text inside <p>

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？