开发者

Regex with exception of particular words

I have problem with regex. I need to make regex with an exception of a set of specified words, for example: apple, orange, juice. and given these words, it will match everything except those words above.

applejuice (match)
yummyjuice (match)
yummy-apple-juice (match)
orangeapplejuice (match)
orange-apple-juice (match)
apple-orange-aple (match)
juice-juice-juice (match)
orange-juice (match)

apple (s开发者_StackOverflowhould not match)
orange (should not match)
juice (should not match)


If you really want to do this with a single regular expression, you might find lookaround helpful (especially negative lookahead in this example). Regex written for Ruby (some implementations have different syntax for lookarounds):

rx = /^(?!apple$|orange$|juice$)/


I noticed that apple-juice should match according to your parameters, but what about apple juice? I'm assuming that if you are validating apple juice you still want it to fail.

So - lets build a set of characters that count as a "boundary":

/[^-a-z0-9A-Z_]/        // Will match any character that is <NOT> - _ or 
                        // between a-z 0-9 A-Z 

/(?:^|[^-a-z0-9A-Z_])/  // Matches the beginning of the string, or one of those 
                        // non-word characters.

/(?:[^-a-z0-9A-Z_]|$)/  // Matches a non-word or the end of string

/(?:^|[^-a-z0-9A-Z_])(apple|orange|juice)(?:[^-a-z0-9A-Z_]|$)/ 
   // This should >match< apple/orange/juice ONLY when not preceded/followed by another
   // 'non-word' character just negate the result of the test to obtain your desired
   // result.

In most regexp flavors \b counts as a "word boundary" but the standard list of "word characters" doesn't include - so you need to create a custom one. It could match with /\b(apple|orange|juice)\b/ if you weren't trying to catch - as well...

If you are only testing 'single word' tests you can go with a much simpler:

/^(apple|orange|juice)$/ // and take the negation of this...


This gets some of the way there:

((?:apple|orange|juice)\S)|(\S(?:apple|orange|juice))|(\S(?:apple|orange|juice)\S)


\A(?!apple\Z|juice\Z|orange\Z).*\Z

will match an entire string unless it only consists of one of the forbidden words.

Alternatively, if you're not using Ruby or you're sure that your strings contain no line breaks or you have set the option that ^ and $ do not match on beginnings/ends of lines

^(?!apple$|juice$|orange$).*$

will also work.


Here's some easy copy-paste code that works for more than just exact-words exceptions.


Copy/Paste Code:

In the following regex, ONLY replace the all-caps sections with your regex.

Python regex
pattern = r"REGEX_BEFORE(?>(?P<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER"
Ruby regex
pattern = /REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(<exceptions_group_1>)always(?<=fail)|)REGEX_AFTER/
PCRE regex
REGEX_BEFORE(?>(?<exceptions_group_1>EXCEPTION_PATTERN)|YOUR_NORMAL_PATTERN)(?(exceptions_group_1)always(?<=fail)|)REGEX_AFTER
JavaScript

Impossible as of 6/17/2020, and probably won't be possible in the near future.



Full Examples

REGEX_BEFORE = \b
YOUR_NORMAL_PATTERN = \w+
REGEX_AFTER =
EXCEPTION_PATTERN = (apple|orange|juice)

Python regex
pattern = r"\b(?>(?P<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)"
Ruby regex
pattern = /\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/
PCRE regex
\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(exceptions_group_1)always(?<=fail)|)



How does it work?

This uses decently complicated regex, namely Atomic Groups, Conditionals, Lookbehinds, and Named Groups.

  1. The (?> is the start of an atomic group, which means its not allowed to backtrack: which means, If that group matches once, but then later gets invalidated because a lookbehind failed, then the whole group will fail to match. (We want this behavior in this case).

  2. The (?<exceptions_group_1> creates a named capture group. Its just easier than using numbers. Note that the pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.

  3. Note that the atomic pattern first tries to find the exception, and then falls back on the normal pattern if it couldn't find the exception.

  4. The real magic is in the (?(exceptions_group_1). This is a conditional asking whether or not exceptions_group_1 was successfully matched. If it was, then it tries to find always(?<=fail). That pattern (as it says) will always fail, because its looking for the word "always" and then it checks 'does "ways"=="fail"', which it never will.

  5. Because the conditional fails, this means the atomic group fails, and because it's atomic that means its not allowed to backtrack (to try to look for the normal pattern) because it already matched the exception.

This is definitely not how these tools were intended to be used, but it should work reliably and efficiently.



Exact answer to the original question in Ruby

/\b(?>(?<exceptions_group_1>(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/

Unlike other methods, this one can be modified to reject any pattern such as any word not containing the sub-string "apple","orange", or "juice".

/\b(?>(?<exceptions_group_1>\w*(apple|orange|juice))|\w+)(?(<exceptions_group_1>)always(?<=fail)|)/


Something like (PHP)

$input = "The orange apple gave juice";
if(preg_match("your regex for validating") && !preg_match("/apple|orange|juice/", $input))
{
  // it's ok;
}
else
{
  //throw validation error
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜