How can I capture all nonempty sequences of letters other than cat, dog, fish using a regular expression?
Please expl开发者_高级运维ain why the expression makes sense if it is complicated.
If you are actually using grep
, you could use the -v
option to select only the lines that don't match:
grep -v \(cat\|dog\|fish\|^$\)
The pattern will select empty lines and lines containing "cat", "dog" and "fish".
Okay, you're not using grep
. According to http://www.regular-expressions.info/refadv.html , if your regex engine supports it, you want ?!
:
`(?!regex)` Zero-width negative lookahead. Identical to positive lookahead, except that the overall match will only succeed if the regex inside the lookahead fails to match. `t(?!s)` matches the first `t` in `streets`.
Let's explore how we can build up a pattern which excludes specific phrases.
We'll start with a simple .*
, which matches any character (using the dot), zero or more times (star). This pattern will match any string, including an empty string1.
However, since there are specific phrases we don't want to match, we can try to use a negative lookaround to stop it from matching what we don't want. A lookaround is a zero-width assertion, which means that the regex engine needs to satisfy the assertion for there to be a match, but the assertion does not consume any characters (or in other words, it doesn't advance the position in the string). In this specific case, we will use a lookahead, which tells the regex engine to look ahead of the current position to match the assertion (there are also lookbehinds, which, naturally, look behind the current position). So we'll try (?!cat|dog|fish).*
.
When we try this pattern against catdogfish
, though, it matches atdogfish
! What's going on here? Let's take a look at what happens when the engine tries to use our pattern on catdogfish
.
The engine works from left to right, starting from before the first character in our string. On it's first attempt, the lookahead asserts that the next characters from that point are not cat
, dog
, or fish
, but since they actually are cat
, the engine cannot match from this point, and advances to before the second character. Here the assertion succeeds, because the next characters following do not satisfy the assertion (atf
does not match cat
or dog
and atfi
does not match fish
). Now that the assertion succeeds, the engine can match .*
, and since by default regular expressions are greedy (which means that they will capture as much of your string as possible), the dot-star will consume the rest of the string.
You might be wondering why the lookaround isn't checked again after the first assertion succeeds. That is because the dot-star is taken as one single token, with the lookaround working on it as a whole. Let's change that so that the lookaround asserts once per repetition: (?:(?!cat|dog|fish).)*
.
The (?:…)
is called a non-capturing group. In general, things in regular expressions are grouped by parentheses, but these parentheses are capturing, which means that the contents are saved into a backreference (or submatch). Since we don't need a submatch here, we can use a non-capturing group, which works the same as a normal group, but without the overhead of keeping track of a backreference.
When we run our new pattern against catdogfish
, we now get three matches2: at
, og
and ish
! Let's take a look at what's going on this time inside the regex engine.
Again the engine starts before the first character. It enters the group that will be repeated ((?!cat|dog|fish).
) and sees that the assertion fails, so moves onto the next position (a
). The assertion succeeds, and the engine moves forwards to t
. Again the assertion succeeds, and the engine moves forwards again. At this point, the assertion fails (because the next three characters are dog
), and the engine returns at
as a match, because that is the biggest string (so far, and the engine works from left to right), that matches the pattern.
Next, even though we've already got a match, the engine will continue. It will move forwards to the next character (o
), and again pick up two characters that match the pattern (og
). Finally, the same thing will happen for the ish
at the end of the string. Once the engine hits the end of the string, there is nothing more for it to do, and it returns the three matches it picked up.
So this pattern still isn't perfect, because it will match parts of a string that contain our disallowed phrases. In order to prevent this, we need to introduce anchors into our pattern: ^(?:(?!cat|dog|fish).)*$
Anchors are also zero-width assertions, that assert that the position the engine is in must be a specific location in the string. In our case, ^
matches the beginning of the string, and $
matches the end of the string. Now when we match our pattern against catdogfish
, none of those small matches can be picked up anymore, because none of them match the anchor positions.
So the final expression would be ^(?:(?!cat|dog|fish).)*$
.
1 However, the dot doesn't match newline characters by default, unless the /s
(or "single line") modifier is enabled on the regex.
2 I'm making the assumption here that the pattern is working in "global" mode, which makes the pattern match as many times as possible. Without global mode, the pattern would only return the first match, at
.
Normally it's better to leave the negation to the code "around" regexp - such as -v switch in grep or !~ in perl. Is there a particular problem you're trying to solve, or its it just an exercise?
精彩评论