开发者

Regex: match everything but a specific pattern

I need a regular expression able to match everything but a string starting with a sp开发者_JS百科ecific pattern (specifically index.php and what follows, like index.php?id=2342343).


Regex: match everything but:

  • a string starting with a specific pattern (e.g. any - empty, too - string not starting with foo):
    • Lookahead-based solution for NFAs:
      • ^(?!foo).*$
      • ^(?!foo)
  • Negated character class based solution for regex engines not supporting lookarounds:
    • ^(([^f].{2}|.[^o].|.{2}[^o]).*|.{0,2})$
    • ^([^f].{2}|.[^o].|.{2}[^o])|^.{0,2}$
  • a string ending with a specific pattern (say, no world. at the end):
    • Lookbehind-based solution:
      • (?<!world\.)$
      • ^.*(?<!world\.)$
    • Lookahead solution:
      • ^(?!.*world\.$).*
      • ^(?!.*world\.$)
    • POSIX workaround:
      • ^(.*([^w].{5}|.[^o].{4}|.{2}[^r].{3}|.{3}[^l].{2}|.{4}[^d].|.{5}[^.])|.{0,5})$
      • ([^w].{5}|.[^o].{4}|.{2}[^r].{3}|.{3}[^l].{2}|.{4}[^d].|.{5}[^.]$|^.{0,5})$
  • a string containing specific text (say, not match a string having foo):
    • Lookaround-based solution:
      • ^(?!.*foo)
      • ^(?!.*foo).*$
    • POSIX workaround:
      • Use the online regex generator at www.formauri.es/personal/pgimeno/misc/non-match-regex
  • a string containing specific character (say, avoid matching a string having a | symbol):
    • ^[^|]*$
  • a string equal to some string (say, not equal to foo):
    • Lookaround-based:
      • ^(?!foo$)
      • ^(?!foo$).*$
    • POSIX:
      • ^(.{0,2}|.{4,}|[^f]..|.[^o].|..[^o])$
  • a sequence of characters:
    • PCRE (match any text but cat): /cat(*SKIP)(*FAIL)|[^c]*(?:c(?!at)[^c]*)*/i or /cat(*SKIP)(*FAIL)|(?:(?!cat).)+/is
    • Other engines allowing lookarounds: (cat)|[^c]*(?:c(?!at)[^c]*)* (or (?s)(cat)|(?:(?!cat).)*, or (cat)|[^c]+(?:c(?!at)[^c]*)*|(?:c(?!at)[^c]*)+[^c]*) and then check with language means: if Group 1 matched, it is not what we need, else, grab the match value if not empty
  • a certain single character or a set of characters:
    • Use a negated character class: [^a-z]+ (any char other than a lowercase ASCII letter)
    • Matching any char(s) but |: [^|]+

Demo note: the newline \n is used inside negated character classes in demos to avoid match overflow to the neighboring line(s). They are not necessary when testing individual strings.

Anchor note: In many languages, use \A to define the unambiguous start of string, and \z (in Python, it is \Z, in JavaScript, $ is OK) to define the very end of the string.

Dot note: In many flavors (but not POSIX, TRE, TCL), . matches any char but a newline char. Make sure you use a corresponding DOTALL modifier (/s in PCRE/Boost/.NET/Python/Java and /m in Ruby) for the . to match any char including a newline.

Backslash note: In languages where you have to declare patterns with C strings allowing escape sequences (like \n for a newline), you need to double the backslashes escaping special characters so that the engine could treat them as literal characters (e.g. in Java, world\. will be declared as "world\\.", or use a character class: "world[.]"). Use raw string literals (Python r'\bworld\b'), C# verbatim string literals @"world\.", or slashy strings/regex literal notations like /world\./.


You could use a negative lookahead from the start, e.g., ^(?!foo).*$ shouldn't match anything starting with foo.


You can put a ^ in the beginning of a character set to match anything but those characters.

[^=]*

will match everything but =


Just match /^index\.php/, and then reject whatever matches it.


In Python:

>>> import re
>>> p='^(?!index\.php\?[0-9]+).*$'
>>> s1='index.php?12345'
>>> re.match(p,s1)
>>> s2='index.html?12345'
>>> re.match(p,s2)
<_sre.SRE_Match object at 0xb7d65fa8>
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜