Match "without this"
I need to rem开发者_JS百科ove all <p></p>
that are only <p>
's in <td>
.
import re
text = """
<td><p>111</p></td>
<td><p>111</p><p>222</p></td>
"""
text = re.sub(r'<td><p>(??no</p>inside??)</p></td>', r'<td>\1</td>', text)
How can I match without</p>inside
?
I would use minidom. I stole the following snippet from here which you should be able to modify and work for you:
from xml.dom import minidom
doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
if element.getAttribute('name') in ['AttrName1', 'AttrName2']:
parentNode = element.parentNode
parentNode.insertBefore(doc.createComment(element.toxml()), element)
parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()
Thanks @Ivo Bosticky
While using regexps with HTML is bad, matching a string that does not contain a given pattern is an interesting question in itself.
Let's assume that we want to match a string beginning with an a
and ending with a z
and take out whatever is in between only when string bar
is not found inside.
Here's my take: "a((?:(?<!ba)r|[^r])+)z"
It basically says: find a
, then find either an r
which is not preceded by ba
, or something different than r
(repeat at least once), then find a z
. So, a bar
cannot sneak in into the catch group.
Note that this approach uses a 'negative lookbehind' pattern and only works with lookbehind patterns of fixed length (like ba
).
I would definitely recommend using BeautifulSoup for this. It's a python HTML/XML parser.
http://www.crummy.com/software/BeautifulSoup/
Not quite sure why you want to remove the P tags which don't have closing tags. However, if this is an attempt to clean code, an advantage of BeautifulSoup is that is can clean HTML for you:
from BeautifulSoup import BeautifulSoup
html = """
<td><p>111</td>
<td><p>111<p>222</p></td>
"""
soup = BeautifulSoup(html)
print soup.prettify()
this doesn't get rid of your unmatched tags, but it fixes the missing ones.
精彩评论