python regex problem
s = re.sub(r"<style.*?</style>", "", s)
Isn't this code supposed to remove styles in the s string? Why does it not work? I am trying to remove the following开发者_Python百科 code:
<style type="text/css">
body { ... }
</style>
Any suggestion?
No it's the re.DOTALL flag that is necessary !
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
http://docs.python.org/library/re.html#re.DOTALL
Edit
In some cases, it may be necessary to have a dot matching all characters (newlines comprised) in a region of a string, and to have a dot matching only non newlines characters in another region of the sting. But using flag re.DOTALL doesn't allow this.
In this case, it's usefull to know the following trick: using [\s\S] to symbolize every character
import re
s = '''alhambra
<style type="text/css">
body { ... }
</style>
toromizuXXXXXXXX
YYYYYYYYYYYYYY'''
print s,'\n'
regx = re.compile("<style[\s\S]*?</style>|(?<=ro)mizu.+")
s = regx.sub('AAA',s)
print s
result
alhambra
<style type="text/css">
body { ... }
</style>
toromizuXXXXXXXX
YYYYYYYYYYYYYY
alhambra
AAA
toroAAA
YYYYYYYYYYYYYY
精彩评论