开发者

Why are regular expressions greedy by default?

It seems that this is a huge source of confusion for beginners writing regular expressio开发者_Go百科ns, can cause hidden performance problems, and it would seem that a typical use case would be non-greedy.

Is this just for legacy reasons (it was how it was first done, and every implementation copies that), or is there a reason for it?


Hysterical Raisens


Part of the answer may involve the origins of REs in practical computing. They were originally a theoretical concept from automata theory and formal language theory until Ken Thompson himself wrote a real implementation and used them in qed and ed(1).

The original version had only the greedy syntax and so there wasn't a decision to even make.


In the case of performance, lazy quantifiers aren't always faster because of backtracking: http://blog.stevenlevithan.com/archives/greedy-lazy-performance

As for the actual design, I honestly can't say why quantifiers are greedy by default but I do wonder what control character would have been used to make a quantifier greedy instead of lazy. I don't think ? would have cut it :-)


Possible reason: The regex engine needs to backtrack a lot if it's non-greedy.


Well, it is important that computers behave predictably whenever possible. So the correct behavior should follow a simple rule, like greedy matching, so that at least experienced programmers can predict the outcome of a piece of code.

As for whether a typical use case should be non-greedy, what about the following: suppose I have a file with entries like foo1909, bar3939, baz3331, and I just want to extract these numbers. It seems natural enough to write (\d*) as the regular expression for this.

You might say that it is just as easy to write (\d*)\D or whatever, but it is basically always the case that the programmer can be more explicit and less ambiguous. Since we wanted a default behavior that was 100% predictable, and trivial to calculate in ones head, it seems reasonable to me.


The real issue here is the Kleene closure operator (star); for everything else in a regular expression, the longest match is the same as the shortest match.

When you think about it in those terms, you realize that more modern tools realize you need both. I'm up late so I can think of only two examples:

  • Both ksh and bash provide "longest match" and "shortest match" forms of most of the special variable-altering operators.

  • The Lua regular expressions include * for Kleene closure longest match and - for Kleene closure shortest match. This one always bites me when I forget to escape a literal - sign.

It would be interesting to go back to Kleene's original work and see if that might have influenced early tools toward longest match.


it would seem that a typical use case would be non-greedy.

I want to make it clear that this is wrong, unless "typical use case" means HTML-hacking.

An easy example are lexical analysers for programming languages. You simply don't want

foo = 42

to be interpreted as 3 variables, followed by an equal sign, followed by 2 numbers. On the contrary, typically you expect your parser to consider the longest possible matches.

Before the advent of HTML, we elder ones have lived for decades with greedy regular expressions, and we did just fine. Even today I do not use non-greedy ones in 99% of all cases, admittedly because I am too lazy to look up the syntax, but also because the occasions are seldom where you couldn't just write a well terminated greedy one. For example, to match a string:

"(\\"|[^"])*"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜