开发者

In C# regular expression why does the initial match show up in the groups?

So if I write a regex it's matches I can get the match or I can access its groups. This seems counter int开发者_Python百科uitive since the groups are defined in the expression with braces "(" and ")". It seems like it is not only wrong but redundant. Any one know why?

Regex quickCheck = new Regex(@"(\D+)\d+");
string source = "abc123";

m.Value        //Equals source
m.Groups.Count //Equals 2
m.Groups[0])   //Equals source
m.Groups[1])   //Equals "abc"


I agree - it is a little strange, however I think there are good reasons for it.

A Regex Match is itself a Group, which in turn is a Capture.

But the Match.Value (or Capture.Value as it actually is) is only valid when one match is present in the string - if you're matching multiple instances of a pattern, then by definition it can't return everything. In effect - the Value property on the Match is a convenience for when there is only match.

But to clarify where this behaviour of passing the whole match into Groups[0] makes sense - consider this (contrived) example of a naive code unminifier:

[TestMethod]
public void UnMinifyExample()
{
  string toUnMinify = "{int somevalue = 0; /*init the value*/} /* end */";
  string result = Regex.Replace(toUnMinify, @"(;|})\s*(/\*[^*]*?\*/)?\s*", "$0\n");
  Assert.AreEqual("{int somevalue = 0; /*init the value*/\n} /* end */\n", result);
}

The regex match will preserve /* */ comments at the end of a statement, placing a newline afterwards - but works for either ; or } line-endings.

Okay - you might wonder why you'd bother doing this with a regex - but humour me :)

If Groups[0] generated by the matches for this regex was not the whole capture - then a single-call replace would not be possible - and your question would probably be asking why doesn't the whole match get put into Groups[0] instead of the other way round!


The documentation for Match says that the first group is always the entire match so it's not an implementation detail.


It's historical is all. In Perl 5, the contents of capture groups are stored in the special variables $1, $2, etc., but C#, Java, and others instead store them in an array (or array-like structure). To preserve compatibility with Perl's naming convention (which has been copied by several other languages), the first group is stored in element number one, the second in element two, etc. That leaves element zero free, so why not store the full match there?

FYI, Perl 6 has adopted a new convention, in which the first capturing group is numbered zero instead of one. I'm sure it wasn't done just to piss us off. ;)


Most likely so that you can use "$0" to represent the match in a substitution expression, and "$1" for the first group match, etc.


I don't think there's really an answer other than the person who wrote this chose that as an implementation detail. As long as you remember that the first group will always equal the source string you should be ok :-)


Not sure why either, but if you use named groups you can then set the option RegExOptions.ExplicitCapture and it should not include the source as first group.


It might be redundant, however it has some nice properties.

For example, it means the capture groups work the same way as other regex engines - the first capture group corresponds to "1", and so on.


Backreferences are one-based, e.g., \1 or $1 is the first parenthesized subexpression, and so on. As laid out, one maps to the other without any thought.

Also of note: m.Groups["0"] gives you the entire matched substring, so be sure to skip "0" if you're iterating over regex.GetGroupNames().

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜