regex possessive quantifier against lazy or greedy

2023-01-07 22:50 问答作者：

Can anyone explain me, step by step, why the regex fails with this:

<.++>

with this string to compare: 开发者_如何学运维;

The same string is found with lazy or greedy quantifiers but in this case what steps are involved?

I use Java regex flavor.

From the Java Pattern documentation:

Possessive quantifiers, which greedily match as much as they can and do not back off, even when doing so would allow the overall match to succeed.

In your example, the < in your regex matches < in the string, then .++ matches the entire rest of the string, em>. You still have a > in your regex, but there are no characters left in the string for it to match (because .++ consumed them all). So the match fails.

If the quantifier were greedy, i.e. if it were .+ instead of .++, at this point the regular expression engine would try reducing the portion matched by .+ by one character, to just em, and try again. This time the match would succeed, because there would be a > left in the string for the > in the regex to match.

EDIT: A lazy quantifier would work like a greedy quantifier in reverse. Instead of starting by trying to match the whole rest of the string and backing off character by character, the lazy quantifier would start by trying to match a single character, in this case just e. If that doesn't allow the full regex to match (which it wouldn't here, because you'd have > in the regex trying to match m in the string), the lazy quantifier would move up to matching two characters, em. Then the > in the regex would line up with > in the string and the match would succeed. If it didn't work out, though, the lazy quantifier would move up to three characters, and so on.

Possessive quantifier prevents backtracking - thus .++ part matches the remaining string em>, eating up the last > also.

Hence the last > of the regex has no match and the regex fails.

Like a greedy quantifier, a possessive quantifier will repeat the token as many times as possible. Unlike a greedy quantifier, it will not give up matches as the engine backtracks. With a possessive quantifier, the deal is all or nothing. You can make a quantifier possessive by placing an extra + after it.

On greedy variant

First let's consider how a pattern like <.+> matches against :

The < in the pattern matches the < in the input.
Then .+ matches em> in the input (because it's greedy, it'll first match as many . as possible)
- Then > doesn't match, since there are no more characters in the input
At this point .+ backtracks and must match one less .; so .+ now matches em
Now > in the pattern matches the > in the input.

On reluctant variant

By contrast, this is how <.+?> matches against :

The < in the pattern matches the < in the input.
Then .+? matches e in the input (because it's reluctant, but must take at least one .)
- Then > doesn't match, since the rest of the input is m>
At this point .+ backtracks and must match one more .; so .+? now matches em
Now > in the pattern matches the > in the input.

On negated character class and possessive quantifiers combo

Note that in either of the above cases, .+ or .+? must backtrack for the > to match. This is why <.++> can NEVER match , because here's what happens:

The < in the pattern matches the < in the input
Then .++ matches as many . in the input, and will be in possession of this match
- It will not let go whatever it matched! (hence "possessive")
- In this case, .++ is able to match em>
Now > in the pattern can never match, because any > will be gobbled up by .++
- Since it's possessive, .++ will not "cooperate" by giving back the >

A pattern that at least has a chance to match is <[^>]++>. When matched against :

The < in the pattern matches the < in the input
Then [^>]++ possessively matches as many [^>] in the input (i.e. anything but >)
- In this case it will possessively match em
Now > in the pattern can match the > in the input

As much as is practical, you should refrain from using .*?/.* in your pattern. The . is too flexible since it matches (almost!) any character, and this can cause unnecessary backtracking and/or overmatching.

Whenever applicable, you should use negated character class instead of .

regular-expressions.info

Possessive Quantifier
Character Class
Repetition with Star and Plus (see also:An Alternative to Laziness)
The Dot Matches (Almost) Any Character
[Catastrophic Backtracking](Catastrophic backtracking

regex possessive quantifier against lazy or greedy

On greedy variant

On reluctant variant

On negated character class and possessive quantifiers combo

regular-expressions.info

Related questions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

On greedy variant

On reluctant variant

On negated character class and possessive quantifiers combo

regular-expressions.info

Related questions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？