Is it better to use a non-greedy qualifier or a lookahead?

2023-01-02 02:52 问答作者：

I have a possibly large block of te开发者_如何学Pythonxt to search for instances of [[...]], where the ... can be anything, including other brackets (though they cannot be nested; the first instance of ]] after [[ ends the match).

I can think of two ways to match this text:

Using a non-greedy qualifier: /\[\[.+?\]\]/
Using a lookahead: /\[\[(?:(?!\]\]).)+\]\]/

Is one choice inherently better than the other, from a performance standpoint (I'd say the first is probably more readable)? I recall reading that it's better not to use non-greedy qualifiers, but I cannot find a source for that now.

It is better to use a non-greedy quantifier in this case.

Take this example string "[[a]b]]"

Non-greedy quantifier

       \[\[.+?\]\]
Atom # 1 2 3  4 5

Atom #1 \[ matches
Atom #2 \[ matches
Atom #3 .+? matches the "a"
Atom #4 \] matches
Atom #5 \] fails, back to #3 but keep string position
Atom #3 .+? matches the "]"
Atom #4 \] fails, back to #3 but keep string position
Atom #3 .+? matches the "b"
Atom #4 \] matches
Atom #5 \] matches
success

Look-ahead:

       \[\[(?:(?!\]\]).)+\]\]
Atom # 1 2 3  4       5  6 7

Atom #1 \[ matches
Atom #2 \[ matches
Atom #4 (?!\]\]) succeeds (i.e. non-match) immediately at "a", go on
Atom #5 . matches the "a", repeat at #3
Atom #4 (?!\]\]) achieves partial match at "]"
Atom #4 (?!\]\]) succeeds (i.e. non-match) at "b", go on
Atom #5 . matches the "]", repeat at #3
Atom #4 (?!\]\]) succeeds (i.e. non-match) immediately at "b", go on
Atom #5 . matches the "b", repeat at #3
Atom #4 (?!\]\]) achieves partial match at "]"
Atom #4 (?!\]\]) achieves full match at "]", ergo: #4 fails, exit #3
Atom #6 \] matches
Atom #7 \] matches
success

So it looks like the non-greedy quantifier has less work to do.

Disclaimer: This is an artificial example and real-life performance may differ, depending on the input, the actual expression and the implementation of the regex engine. I'm only 98% sure that what I outlined here is what is actually happening, so I'm open for corrections. Also, as with all performance tips, don't take this at face value, do your own benchmark comparisons if you want to know for sure.

Another variant: /\[\[((?:\]?[^]])+)]]/

It uses neither non-greedy quantifiers nor look-aheads. It allows a single ] before any non-]. If there would be two ] in sequence, the inner repetition would stop, and and the match would end.

This pattern would be best to use with FSA-compiling regex engines. On back-tracking engines, it could get slower than the non-greedy variant.

Which regex flavor are you using? If it's one that supports possessive quantifiers, there's a much better alternative:

\[\[(?:[^\]]++|\](?!\]))*+\]\]

[^\]]++ gobbles up any characters other than ] and doesn't bother saving the state information that would make backtracking possible. If it does see a ], it performs a lookahead to see if there's another. Wrapping the whole thing in another possessive quantifier means it only does a lookahead whenever it sees a ], and it only backtracks once: when it finds the closing ]].

Possessive quantifiers are supported by the Java, JGSoft, PCRE (PHP), Oniguruma (Ruby 1.9), and Perl 5.12 flavors. All those flavors also support atomic groups, which can be used to achieve the same effect:

\[\[(?>(?:(?>[^\]]+)|\](?!\]))*)\]\]

The .NET flavor supports atomic groups but not possessive quantifiers.

I would think it is better to use the non-greedy qualifier. Are you sure that the article you read wasn't saying "be careful with greedy matching?"

继续阅读：performance regex regex-greedy

Is it better to use a non-greedy qualifier or a lookahead?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？