Python re.sub use non-greedy mode (.*?) with end of string ($) it comes greedy!
Code:
str = '<br><br />A<br />B'
print(re.sub(r开发者_如何转开发'<br.*?>\w$', '', str))
It is expected to return <br><br />A
, but it returns an empty string ''
!
Any suggestion?
Greediness works from left to right, but not otherwise. It basically means "don't match unless you failed to match". Here's what's going on:
- The regex engine matches
<br
at the start of the string. .*?
is ignored for now, it is lazy.- Try to match
>
, and succeeds. - Try to match
\w
and fails. Now it's interesting - the engine starts backtracking, and sees the.*?
rule. In this case,.
can match the first>
, so there's still hope for that match. - This keep happening until the regex reaches the slash. Then
>\w
can match, but$
fails. Again, the engine comes back to the lazy.*
rule, and keeps matching, until it matches<br><br />A<br />B
Luckily, there's an easy solution: By replacing <br[^>]*>\w$
you don't allow matching outside of your tags, so it should replace the last occurrence.
Strictly speaking, this doesn't work well for HTML, because tag attributes can contain >
characters, but I assume it's just an example.
The non-greediness won't start later on like that. It matches the first <br
and will non-greedily match the rest, which actually need to go to the end of the string because you specify the $
.
To make it work the way you wanted, use
/<br[^<]*?>\w$/
but usually, it is not recommended to use regex to parse HTML, as some attribute's value can have <
or >
in it.
精彩评论