Regex lookahead ordering
I'm pretty decent with regular expressions, and now I'm trying once again to understand lookahead and lookbehind assertions. They mostly make sense, but I'm not quite sure how the order affects the result. I've been looking at this site which places lookbehinds before the expression, and lookaheads after the expression. My question is, does this change anything? A recent an开发者_运维百科swer here on SO placed the lookahead before the expression which is leading to my confusion.
When tutorials introduce lookarounds, they tend to choose the simplest use case for each one. So they'll use examples like (?<!a)b
('b' not preceded by 'a') or q(?=u)
('q' followed by 'u'). It's just to avoid cluttering the explanation with distracting details, but it tends to create (or reinforce) the impression that lookbehinds and lookaheads are supposed to appear in a certain order. It took me quite a while to get over that idea, and I've seen several others afflicted with it, too.
Try looking at some more realistic examples. One question that comes up a lot involves validating passwords; for example, making sure a new password is at least six characters long and contains at least one letter and one digit. One way to do that would be:
^(?=.*[A-Za-z])(?=.*\d)[A-Za-z0-9]{6,}$
The character class [A-Za-z0-9]{6,}
could match all letters or all digits, so you use the lookaheads to ensure that there's at least one of each. In this case, you have to do the lookaheads first, because the later parts of the regex have to be able to examine the whole string.
For another example, suppose you need to find all occurrences of the word "there" unless it's preceded by a quotation mark. The obvious regex for that is (?<!")[Tt]here\b
, but if you're searching a large corpus, that could create a performance problem. As written, that regex will do the negative lookbehind at each and every position in the text, and only when that succeeds will it check the rest of the regex.
Every regex engine has its own strengths and weaknesses, but one thing that's true of all of them is that they're quicker to find fixed sequences of literal characters than anything else--the longer the sequence, the better. That means it can be dramatically faster to do the lookbehind last, even though it means matching the word twice:
[Tt]here\b(?<!"[Tt]here)
So the rule governing the placement of lookarounds is that there is no rule; you put them wherever they make the most sense in each case.
It's easier to show in an example than explain, I think. Let's take this regex:
(?<=\d)(?=(.)\1)(?!p)\w(?<!q)
What this means is:
(?<=\d)
- make sure what comes before the match position is a digit.(?=(.)\1)
- make sure whatever character we match at this (same) position is followed by a copy of itself (through the backreference).(?!p)
- make sure what follows is not ap
.\w
- match a letter, digit or underscore. Note that this is the first time we actually match and consume the character.(?<!q)
- make sure what we matched so far doesn't end with aq
.
All this will match strings like abc5ddx
or 9xx
but not 5d
or 6qq
or asd6pp
or add
. Note that each assertion works independently. It just stops, looks around, and if all is well, allows the matching to continue.
Note also that in most (probably all) implementations, lookbehinds have the limitation of being fixed-length. You can't use repetition/optionality operators like ?
, *
, and +
in them. This is because to match a pattern we need a starting point - otherwise we'd have to try matching each lookbehind from every point in the string.
A sample run of this regex on the string a3b5ddx
is as follows:
- Text cursor position: 0.
- Try to match the first lookbehind at position -1 (since
\d
always matches 1 character). We can't match at negative indices, so fail and advance the cursor.
- Try to match the first lookbehind at position -1 (since
- Text cursor position: 1.
- Try to match the first lookbehind at position 0.
a
does not match\d
so fail and advance the cursor again.
- Try to match the first lookbehind at position 0.
- Text cursor position: 2.
- Try to match the first lookbehind at position 1.
3
does match\d
so keep the cursor intact and continue matching. - Try to match the first lookahead at position 2.
b
matches(.)
and is captured.5
does not match\1
(which is the capturedb
). Therefore, fail and advance the cursor.
- Try to match the first lookbehind at position 1.
- Text cursor position: 3.
- Try to match the first lookbehind at position 2.
b
does not match\d
so fail and advance the cursor again.
- Try to match the first lookbehind at position 2.
- Text cursor position: 4.
- Try to match the first lookbehind at position 3.
5
does match\d
so keep the cursor intact and continue matching. - Try to match the first lookahead at position 4.
d
matches(.)
and is captured. The secondd
does match\1
(which is the first capturedd
). Allow the matching to continue from where we left off. - Try to match the second lookahead.
b
at position 4 does not matchp
, and since this is a negative lookahead, that's what we want; allow the matching to continue. - Try to match
\w
at position 4.b
matches. Advance cursor since we have consumed a character and continue. Also mark this as the start of the match.
- Try to match the first lookbehind at position 3.
- Text cursor position: 5.
- Try to match the second lookbehind at position 4 (since
q
always matches 1 character).d
does not matchq
which is what we want from a negative lookbehind. - Realize that we're at the end of the regex and report success by returning the substring from the start of the match to the current position (4 to 5), which is
d
.
- Try to match the second lookbehind at position 4 (since
1(?=ABC)
means - look for 1
, and match (but don't capture) ABC
after it.
(?<=ABC)1
means - match (but don't capture) ABC
before the current location, and continue to match 1
.
So, normally, you'll place the lookahead after the expression and the lookbehind before it.
When we place a lookbehind after the expression, we're rechecking the string we've already matched. This is common when you have complex conditions (you can think about it as the AND
of regexs). For example, take a look on this recent answer by Daniel Brückner:
.&.(?<! & )
First, you capture an ampersand between two characters. Next, you check they were both not spaces (\S&\S
would not work here, the OP wanted to capture 1&_
).
精彩评论