Why is the rightmost character captured in backreference when using a character class with quantifiers?
If I have pattern ([a-z]){2,4} and string "ab", what would I expect to see in backre开发者_如何学编程ference \1 ?
I'm getting "b", but why "b" rather than "a"?
I'm sure there is a valid explanation, but reading around various sites explaining regexes, I haven't found one. Anybody?
I'm not sure why nobody put this as an answer, but just for anyone hitting this page with a similar question, the answer is essentially that this regex:
([a-z]){2-4}
will match a single character between a
and z
at least 2
and as many as 4
times. It will match each character separately, overwriting anything previously matched and stored into the backreference (that is, whatever is between the ()
characters in the expression).
A similar expression (suggested in the comments on the question):
([a-z]{2,4})
moves the back-reference to surround the entire match (2
-4
characters a
-z
) instead of a single character.
The parentheses represent a capture into a back-reference. When the repetition is inside the capture (the second example), it will capture all characters that make up that repetition. When the repetition is outside the capture (the first example), it will capture one letter, then repeat the process, capturing the next letter into the same back-reference, thus overwriting it. In this case, it will then repeat that process up to 2 more times, overwriting it each time.
So, matching against the target abc
will result in \1
equaling c
. Matching the target against abcd
will result in \1
equaling d
. With more letters, and depending upon the function (and language) used to run the regular expression, the target abcde
might fail to match, or might result in the back-reference \1
equaling d
(because the e
is not part of the match).
The first example expression can be used to get abc
or abcd
if you use the whole match
back-reference (often times $&
or $0
, but also \&
or \0
and in Tcl, just an &
character) - this returns the entire string matched by the entire regular expression.
精彩评论