Why does ".*" and ".+" give different results?

2022-12-11 09:17 问答作者：

System.out.println("foo".replaceAll(".+", "bar")); // --> "bar"
System.out.println("foo".replaceAll(".*", "bar")); //--> "barbar"

I would expect "bar" for both, since * and + are both greedy and should match the whole String. (Th开发者_开发问答e above example is Java, but other Tools, like http://www.gskinner.com/RegExr/ give me the same result)

You're right about both being greedy but ".*" is matching two strings: the first one is "foo" and the second is "". ".+" will only match "foo".

Both try to match the longest possible string which is "foo". After that, they try to find the longest matching string coming after the previous match. In this phase, ".*" is able to match an empty string while ".+" won't.

Mehrdad already explained that it also matches one empty substring at the end of the string. I found an official explanation of this behavior (why match one empty substring instead of an infinite number) in the .net documentation:

http://msdn.microsoft.com/en-us/library/c878ftxe.aspx

Quantifiers *, +, {n,m} (and their "lazy" counterparts) never repeat after an empty match when the minimum number n has been matched. This rule prevents quantifiers from entering infinite loops on empty matches when m is infinite (although the rule applies even if m is not infinite).

For example, (a?)* matches the string "aaa" and captures substrings in the pattern (a)(a)(a)(). Note that there is no fifth empty capture, because the fourth empty capture causes the quantifier to stop repeating.

Tested by experiment: replaceAll's matcher won't match twice in the same string position without advancing.

Experiment:

System.out.println("foo".replaceAll(".??", "[bar]"));

Output:

[bar]f[bar]o[bar]o[bar]

Explanation:

The pattern .?? is a non-greedy match of 0 or 1 characters, which means it will match nothing by preference, and one character if forced to. On the first iteration, it matches nothing, and the replaceAll replaces "" with "[bar]" in the beginning of the string. On the second iteration, it would match nothing again, but that's prohibited, so instead one character is copied from the input to the output ("f"), the position is advanced, the match is tried again, etc. so you have bar - f - bar - o - bar - o - bar: one "[bar]" for every distinct place where an empty string can be matched. At the end there's no possibility to advance so the replacement terminates, but only after matching the "final" empty string.

Just for curiosity's sake, Perl does something very similar, but it applies the rule differently, giving an output of "[bar][bar][bar][bar][bar][bar][bar]" for the same input and the same pattern -- .?? is still prohibited from making a zero-width match twice in a row in the same position, but it's allowed to backtrack and match a single character. Meaning it replaces "" with "[bar]", then replaces "f" with "[bar]", then "" with "[bar]" then "o" with "[bar]", etc. until at the end of the string the zero-width match is prohibited and there's no further positive-width match possible.

My guess is that the greedy .* first matches the whole string and then starts looking for a match from the current position (end of string) and matches the empty string before quitting.

hm, Python in both cases produces 'bar':

>>> import re
>>> re.sub('.+', 'bar', 'foo')
'bar'
>>> re.sub('.*', 'bar', 'foo')
'bar'

That's a really interesting question.

When you think about it, String.replaceAll(...) could logically have been implemented to do one of three things in the ".*" case:

do one replacement, giving "bar"
do two replacements giving "barbar"
try do an infinite number of replacements.

Clearly, the last alternative is not useful, so I can understand why they didn't do that. But we don't know why they chose "barbar" interpretation instead of the "bar" interpretation. The problem is that there is no universal standard for Regex syntax, yet alone Regex semantics. My guess is that the Sun author(s) did one of the following:

look at what other pre-existing implementations did and copied,
thought about it and did what they thought was best, or
didn't consider this edge case, and the current behavior is unintentional.

But at the end of the day, it doesn't really matter WHY they chose "barbar". The fact is that they did ... and we just need to deal with this.

I think, the first round both patterns (.+ and .*) match all of string ("foo"). After that, remaining input that is empty string will be matched by .* pattern.

However, I found a quite strange result from the following patterns.

^.*  => 'bar'
.*$  => 'barbar'
^.*$ => 'bar'

Can you explain why it returns the above result? What's different between start string (^) and end string ($) in Regular Expression?

Update.1

I try to change input string to the following string.

foo

foo

Please look at new result!

'^.*' =>

bar

foo

'.*$' =>

foo

barbar

So, I think, there is only one beginning string for each input. In the other hand, when function find match string in input string, it does not remove ending string for current current string. PS. You can quickly try it at http://gskinner.com/RegExr/

继续阅读：regex

Why does ".*" and ".+" give different results?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？