开发者

python remove text inside <p>

I want to remove text inside <p> tags for a block of html text. I am trying to standardize some text and remove all class, align, and other information. Every example I can find seems to deal with stripping html, and I don't want to strip the tags. I just want to make them all plain.

So if I have something like this:

<p class='MsoBodyText' align='left'>开发者_如何学C;
some paragraph blah blah blah
</p>

<p class='SomeClassIDontWant' align='right'>
some other paragraph blah blah blah
</p>

I want to return:

<p>
some paragraph blah blah blah
</p>

<p>
some other paragraph blah blah blah
</p>


Use a library for parsing HTML such as Beautiful Soup or a similar alternative. Regex is not powerful enough to correctly parse HTML.

@Mark made a valid point that in this particular case a simple regex should work because you are not doing full parsing with tag matching etc. I still think it's a good practice to familiarize yourself with these parsing libraries when you find yourself needing more complex operations.

<p title="1 > 0">Test</p>

I believe is valid html. At the very least Chrome accepts it, and I'm sure other browsers do as well.


Using BeautifulSoup is quite easy, you create a BeautifulSoup element from the string and then for each element in that object you set the attribute list to an empty list just like this:

from BeautifulSoup import *
parsed_html = BeautifulSoup(your_html)
for elem in parsed_html:
   if not isinstance(elem, NavigableString): #You need to know that it is a node and not text
       elem.attrs = []
print parsed_html # It is clean now

For more information about BeautifulSoup you can see the BeautifulSoup documentation


Regex will miss in case of delimiters etc. You should use an HTML parser, most common one being beautiful soup.

Also note that you need to handle Unicode as well as simple str.

Here is a solution from me:

from BeautifulSoup import BeautifulSoup, Tag

def clear_p_tags(html_str):
    """ Works well both for unicode as well as str """
    html = BeautifulSoup(html_str)

    for elem in parsed_html:
        if type(elem) is Tag: elem.attrs = []
    return type(html_str)(html)


def test_p_clear(str_data):

    html_str = data
    html_unicode = unicode(data)

    clear_p_html_str = clear_p_tags(html_str)
    clear_p_html_unicode = clear_p_tags(html_unicode)

    print type(clear_p_html_str)
    print clear_p_html_str

    print type(clear_p_html_unicode)
    print clear_p_html_unicode

data = """
<a href="hello.txt"> hello </a>
<p class='MsoBodyText' align='left'>
some paragraph blah blah blah
</p>

<p class='SomeClassIDontWant' align='right'>
some other paragraph blah blah blah
</p>
"""

test_p_clear(data)


I am all for Davy8's answer. You might also look into lxml.html.

If you still want to use regular expressions... you should use something like:

re.sub(r'<p [^>]*?>', r'<p>', foo)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜