开发者

Why does ".*" and ".+" give different results?

Why does ".*" and ".+" give different results?

System.out.println("foo".replaceAll(".+", "bar")); // --> "bar"
System.out.println("foo".replaceAll(".*", "bar")); //--> "barbar"

I would expect "bar" for both, since * and + are both greedy and should match the whole String. (Th开发者_开发问答e above example is Java, but other Tools, like http://www.gskinner.com/RegExr/ give me the same result)


You're right about both being greedy but ".*" is matching two strings: the first one is "foo" and the second is "". ".+" will only match "foo".

Both try to match the longest possible string which is "foo". After that, they try to find the longest matching string coming after the previous match. In this phase, ".*" is able to match an empty string while ".+" won't.


Mehrdad already explained that it also matches one empty substring at the end of the string. I found an official explanation of this behavior (why match one empty substring instead of an infinite number) in the .net documentation:

http://msdn.microsoft.com/en-us/library/c878ftxe.aspx

Quantifiers *, +, {n,m} (and their "lazy" counterparts) never repeat after an empty match when the minimum number n has been matched. This rule prevents quantifiers from entering infinite loops on empty matches when m is infinite (although the rule applies even if m is not infinite).

For example, (a?)* matches the string "aaa" and captures substrings in the pattern (a)(a)(a)(). Note that there is no fifth empty capture, because the fourth empty capture causes the quantifier to stop repeating.


Tested by experiment: replaceAll's matcher won't match twice in the same string position without advancing.

Experiment:

System.out.println("foo".replaceAll(".??", "[bar]"));

Output:

[bar]f[bar]o[bar]o[bar]

Explanation:

The pattern .?? is a non-greedy match of 0 or 1 characters, which means it will match nothing by preference, and one character if forced to. On the first iteration, it matches nothing, and the replaceAll replaces "" with "[bar]" in the beginning of the string. On the second iteration, it would match nothing again, but that's prohibited, so instead one character is copied from the input to the output ("f"), the position is advanced, the match is tried again, etc. so you have bar - f - bar - o - bar - o - bar: one "[bar]" for every distinct place where an empty string can be matched. At the end there's no possibility to advance so the replacement terminates, but only after matching the "final" empty string.

Just for curiosity's sake, Perl does something very similar, but it applies the rule differently, giving an output of "[bar][bar][bar][bar][bar][bar][bar]" for the same input and the same pattern -- .?? is still prohibited from making a zero-width match twice in a row in the same position, but it's allowed to backtrack and match a single character. Meaning it replaces "" with "[bar]", then replaces "f" with "[bar]", then "" with "[bar]" then "o" with "[bar]", etc. until at the end of the string the zero-width match is prohibited and there's no further positive-width match possible.


My guess is that the greedy .* first matches the whole string and then starts looking for a match from the current position (end of string) and matches the empty string before quitting.


hm, Python in both cases produces 'bar':

>>> import re
>>> re.sub('.+', 'bar', 'foo')
'bar'
>>> re.sub('.*', 'bar', 'foo')
'bar'


That's a really interesting question.

When you think about it, String.replaceAll(...) could logically have been implemented to do one of three things in the ".*" case:

  • do one replacement, giving "bar"
  • do two replacements giving "barbar"
  • try do an infinite number of replacements.

Clearly, the last alternative is not useful, so I can understand why they didn't do that. But we don't know why they chose "barbar" interpretation instead of the "bar" interpretation. The problem is that there is no universal standard for Regex syntax, yet alone Regex semantics. My guess is that the Sun author(s) did one of the following:

  • look at what other pre-existing implementations did and copied,
  • thought about it and did what they thought was best, or
  • didn't consider this edge case, and the current behavior is unintentional.

But at the end of the day, it doesn't really matter WHY they chose "barbar". The fact is that they did ... and we just need to deal with this.


I think, the first round both patterns (.+ and .*) match all of string ("foo"). After that, remaining input that is empty string will be matched by .* pattern.

However, I found a quite strange result from the following patterns.

^.*  => 'bar'
.*$  => 'barbar'
^.*$ => 'bar'

Can you explain why it returns the above result? What's different between start string (^) and end string ($) in Regular Expression?

Update.1

I try to change input string to the following string.

foo

foo

Please look at new result!

'^.*' =>

bar

foo

'.*$' =>

foo

barbar

So, I think, there is only one beginning string for each input. In the other hand, when function find match string in input string, it does not remove ending string for current current string. PS. You can quickly try it at http://gskinner.com/RegExr/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜