python remove text inside <p>
I want to remove text inside <p>
tags for a block of html text. I am trying to standardize some text and remove all class, align, and other information. Every example I can find seems to deal with stripping html, and I don't want to strip the tags. I just want to make them all plain.
So if I have something like this:
<p class='MsoBodyText' align='left'>开发者_如何学C;
some paragraph blah blah blah
</p>
<p class='SomeClassIDontWant' align='right'>
some other paragraph blah blah blah
</p>
I want to return:
<p>
some paragraph blah blah blah
</p>
<p>
some other paragraph blah blah blah
</p>
Use a library for parsing HTML such as Beautiful Soup or a similar alternative. Regex is not powerful enough to correctly parse HTML.
@Mark made a valid point that in this particular case a simple regex should work because you are not doing full parsing with tag matching etc. I still think it's a good practice to familiarize yourself with these parsing libraries when you find yourself needing more complex operations.
<p title="1 > 0">Test</p>
I believe is valid html. At the very least Chrome accepts it, and I'm sure other browsers do as well.
Using BeautifulSoup
is quite easy, you create a BeautifulSoup element from the string and then for each element in that object you set the attribute list to an empty list just like this:
from BeautifulSoup import *
parsed_html = BeautifulSoup(your_html)
for elem in parsed_html:
if not isinstance(elem, NavigableString): #You need to know that it is a node and not text
elem.attrs = []
print parsed_html # It is clean now
For more information about BeautifulSoup you can see the BeautifulSoup documentation
Regex will miss in case of delimiters etc. You should use an HTML parser, most common one being beautiful soup.
Also note that you need to handle Unicode as well as simple str.
Here is a solution from me:
from BeautifulSoup import BeautifulSoup, Tag
def clear_p_tags(html_str):
""" Works well both for unicode as well as str """
html = BeautifulSoup(html_str)
for elem in parsed_html:
if type(elem) is Tag: elem.attrs = []
return type(html_str)(html)
def test_p_clear(str_data):
html_str = data
html_unicode = unicode(data)
clear_p_html_str = clear_p_tags(html_str)
clear_p_html_unicode = clear_p_tags(html_unicode)
print type(clear_p_html_str)
print clear_p_html_str
print type(clear_p_html_unicode)
print clear_p_html_unicode
data = """
<a href="hello.txt"> hello </a>
<p class='MsoBodyText' align='left'>
some paragraph blah blah blah
</p>
<p class='SomeClassIDontWant' align='right'>
some other paragraph blah blah blah
</p>
"""
test_p_clear(data)
I am all for Davy8's answer. You might also look into lxml.html.
If you still want to use regular expressions... you should use something like:
re.sub(r'<p [^>]*?>', r'<p>', foo)
精彩评论