开发者

Regex for "AND NOT" operation [duplicate]

This question already has answers here: Regex: match everything but a specific pattern (6 answers) Closed 12 months ago.

I'm looking for a general regex construct to match everything in pattern x EXCEPT matches to pattern y. This is hard to explain both completely and concisely...see Material Nonimplication for a formal definition.

For example, match any word character (\w) EXCEPT 'p'. Note I'm subtracting a small set (the letter 'p') from a larger set (all word characters). I can't just say [^p] because that doesn't take into account the larger limiting set of only word characters. For this little example, sure, I could manually reconstruct something like [a-oq-zA-OQ-Z0-9_], which is a pain but doable. But i'm looking for a more general construct so that at least the large positive set can be a more complex expression. Like match ((?<=(so|me|^))big(com?pl{1,3}ex([pA]t{2}ern) except when it starts with "My".

Edit: I realize that was a bad example, since excluding stuff at the begginning or end is a situation where negative look-ahead and look-behind expressions work. (Bohemian I still gave you an upvote for illustrating this). So...what about excluding matches that contain "My" somewhere in the middle?...I'm still really looking for a general construct, like a regex equivalent of the following pseudo-sql

select [captures] from [input]
where (
    input MATCHES [pattern1]
    AND NOT capture MATCHES [pattern2]
)

If there answer is "it does not exist and here is why..." I'd like to know that too.

Edit 2: If I wanted to define my own function to do this 开发者_运维百科it would be something like (here's a C# LINQ version):

public static Match[] RegexMNI(string input, 
                               string positivePattern, 
                               string negativePattern) {
    return (from Match m in Regex.Matches(input, positivePattern)
            where !Regex.IsMatch(m.Value, negativePattern)
            select m).ToArray();
}

I'm STILL just wondering if there is a native regex construct that could do this.


This will match any character that is a word and is not a p:

((?=[^p])\w)

To solve your example, use a negative look-ahead for "My" anywhere in the input, ie (?!.*My):

^(?!.*My)((?<=(so|me|^))big(com?pl{1,3}ex([pA]t{2}ern)

Note the anchor to start of input ^ which is required to make it work.


I wonder why people try to do complicated things in big monolithic regular expressions?

Why can't you just break down the problem into sub-parts and then make really easy regular expressions to match those individually? In this case, first match \w, then match [^p] if that first match succeeds. Perl (and other languages) allows for constructing really complicated-looking regular expressions that allows you to do exactly what you need to do in one big blobby-regex (or, as it may well be, with a short and snappy crypto-regex), but for the sake of whoever it is that needs to read (and maintain!) the code once you've gone you need to document it fully. Better then to make it easy to understand from the start.

Sorry, rant over.


After your edits, its still the negative lookahead, but with an additional quantifier.

If you want to ensure that the whole string does not contain "My", then you can do this

(?!.*My)^.*$

See it here on Regexr

This will match any sequence of characters (with the .* at the end) and the (?!.*My).* at the beginning will fail when there is a "My" anywhere in the string.

If you want to match anything that si not exactly "My" then use anchors

(?!^My$).*


So after looking through these topics on RegEx's: lookahead, lookbehind, nesting, AND operator, recursion, subroutines, conditionals, anchors, and groups, I've come to the conclusion that there is no solution that satisfies what you're asking for.

The reason why lookahead doesn't work is because it fails in this relatively simple case:

Three words without My included as one.

Regex:

^(?!.*My.*)(\b\w+\b\s\b\w+\b\s\b\w+\b)

Matches:

included as one

The first three words fail to match because My happens after them. If "My" is at the end of the entire string, you'll never match anything because every lookahead will fail because they will all see that.

The problem appears to be that while lookahead has an implicit anchor as to where it begins its match, there's no way of terminating where lookahead ends its search with an anchor based upon the result of another part of the RegEx. That means you really have to duplicate all of the RegEx into the negative lookahead to manually create the anchor you're after.

This is frustrating and a pain. The "solution" appears to be use a scripting language to perform two regex's. One on top of the other. I'm surprised this kind of functionality isn't better built into regular expression engines.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜