开发者

A Regex that will never be matched by anything

What's your thought - what does a Regex look like, that will never be matched by any string, ever!

Edit: Why I want this? Well, firstly because I find it interesting to think of such an expression and secondly because I need it for a script.

In that script I define a dictionary as Dictionary<string, Regex>. This contains, as you see, a string and an expression.

Based on that dictionary I create methods that all use this dictionary as only reference on how they should do their work, one of them matches the regexes against a parsed logfile.

If an expression is matched, another Dictionary<string, long> is added a value that is returned by the expression. So, to catch any log-messages that are not matched by an expression in the dictionary I created a new group called "unknown".

To this group everything that didn't match anything other is added. But to prevent the "unknown"-expression to mismatch (by accident) a log-message, I had to create an expression that is most certainly never matched, no matter what string I give it.开发者_Python百科


Leverage negative lookahead:

>>> import re
>>> x=r'(?!x)x'
>>> r=re.compile(x)
>>> r.match('')
>>> r.match('x')
>>> r.match('y')

this RE is a contradiction in terms and therefore will never match anything.

NOTE:
In Python, re.match() implicitly adds a beginning-of-string anchor (\A) to the start of the regular expression. This anchor is important for performance: without it, the entire string will be scanned. Those not using Python will want to add the anchor explicitly:

\A(?!x)x


This is actually quite simple, although it depends on the implementation / flags*:

$a

Will match a character a after the end of the string. Good luck.

WARNING:
This expression is expensive -- it will scan the entire line, find the end-of-line anchor, and only then not find the a and return a negative match. (See comment below for more detail.)


* Originally I did not give much thought on multiline-mode regexp, where $ also matches the end of a line. In fact, it would match the empty string right before the newline, so an ordinary character like a can never appear after $.


One that was missed:

^\b$

It can't match because the empty string doesn't contain a word boundary. Tested in Python 2.5.


look around:

(?=a)b

For regex newbies: The positive look ahead (?=a) makes sure that the next character is a, but doesn't change the search location (or include the 'a' in the matched string). Now that next character is confirmed to be a, the remaining part of the regex (b) matches only if the next character is b. Thus, this regex matches only if a character is both a and b at the same time.


a\bc, where \b is a zero-width expression that matches word boundary.

It can't appear in the middle of a word, which we force it to.


$.

.^

$.^

(?!)


Maximal matching

a++a

At least one a followed by any number of a's, without backtracking. Then try to match one more a.

or Independent sub expression

This is equivalent to putting a+ in an independent sub expression, followed by another a.

(?>a+)a


\B\b

\b matches word boundaries - the position between a letter an a non-letter (or the string boundary).
\B is its complement - it matches the position between two letters or between non-letters.

Together they cannot match any position.

See also:

  • Word Boundaries
  • This pattern not matching a few positions
  • Inspiration


Perl 5.10 supports special control words called "verbs", which is enclosed in (*...) sequence. (Compare with (?...) special sequence.) Among them, it includes (*FAIL) verb which returns from the regular expression immediately.

Note that verbs are also implemented in PCRE shortly after, so you can use them in PHP or other languages using PCRE library too. (You cannot in Python or Ruby, however. They use their own engine.)


How about $^ or maybe (?!)


This seems to work:

$.


The fastest will be:

r = re.compile(r'a^')
r.match('whatever')

'a' can be any non-special character ('x','y'). Knio's implementation might be a bit more pure but this one will be faster for all strings not starting with whatever character you choose instead of 'a' because it will not match after the first character rather than after the second in those cases.


[^\d\D] or (?=a)b or a$a or a^a


This won't work for Python, and many other languages, but in a Javascript regex, [] is a valid character class that can't be matched. So the following should fail immediately, no matter what the input:

var noMatch = /^[]/;

I like it better than /$a/ because to me, it clearly communicates its intent. And as for when you would ever need it, I needed it because I needed a fallback for a dynamically compiled pattern based on user input. When the pattern is invalid, I need to replace it with a pattern that matches nothing. Simplified, it looks like this:

try {
    var matchPattern = new RegExp(someUserInput);
}
catch (e) {
    matchPattern = noMatch;
}


Python won't accept it, but Perl will:

perl -ne 'print if /(w\1w)/'

This regex should (theoretically) try to match an infinite (even) number of ws, because the first group (the ()s) recurses into itself. Perl doesn't seem to be issuing any warnings, even under use strict; use warnings;, so I assume it's at least valid, and my (minimal) testing fails to match anything, so I submit it for your critique.


So many good answers!

Similar to @nivk's answer, I would like to share performance comparison for Perl for different variants of never-matching regex.

  1. Input: pseudo-random ascii strings (25,000 different lines, length 8-16):

Regex speed:

Total for   \A(?!x)x: 69.675450 s, 1435225 lines/s
Total for       a\bc: 71.164469 s, 1405195 lines/s
Total for    (?>a+)a: 71.218324 s, 1404133 lines/s
Total for       a++a: 71.331362 s, 1401907 lines/s
Total for         $a: 72.567302 s, 1378031 lines/s
Total for     (?=a)b: 72.842308 s, 1372828 lines/s
Total for     (?!x)x: 72.948911 s, 1370822 lines/s
Total for       ^\b$: 79.417197 s, 1259173 lines/s
Total for         $.: 88.727839 s, 1127041 lines/s
Total for       (?!): 111.272815 s, 898692 lines/s
Total for         .^: 115.298849 s, 867311 lines/s
Total for    (*FAIL): 350.409864 s, 285380 lines/s
  1. Input: /usr/share/dict/words (100,000 English words).

Regex speed:

Total for   \A(?!x)x: 128.336729 s, 1564805 lines/s
Total for     (?!x)x: 132.138544 s, 1519783 lines/s
Total for       a++a: 133.144501 s, 1508301 lines/s
Total for    (?>a+)a: 133.394062 s, 1505479 lines/s
Total for       a\bc: 134.643127 s, 1491513 lines/s
Total for     (?=a)b: 137.877110 s, 1456528 lines/s
Total for         $a: 152.215523 s, 1319326 lines/s
Total for       ^\b$: 153.727954 s, 1306346 lines/s
Total for         $.: 170.780654 s, 1175906 lines/s
Total for       (?!): 209.800379 s, 957205 lines/s
Total for         .^: 217.943800 s, 921439 lines/s
Total for    (*FAIL): 661.598302 s, 303540 lines/s

(Ubuntu on Intel i5-3320M, Linux kernel 4.13, Perl 5.26)


Empty regex

The best regex to never match anything is an empty regex. But I'm not sure all regex engine will accept that.

Impossible regex

The other solution is to create an impossible regex. I found that $-^ only takes two steps to compute regardless of the size of your text (https://regex101.com/r/yjcs1Z/1).

For reference:

  • $^ and $. take 36 steps to compute -> O(1)
  • \b\B takes 1507 steps on my sample and increase with the number of character in your string -> O(n)

More popular thread about this question:

  • Regex to *not* match any characters


I believe that

\Z RE FAILS! \A

covers even the cases where the regular expression includes flags like MULTILINE, DOTALL etc.

>>> import re
>>> x=re.compile(r"\Z RE FAILS! \A")
>>> x.match('')
>>> x.match(' RE FAILS! ')
>>>

I believe (but I haven't benchmarked it) that whatever the length (> 0) of the string between \Z and \A, the time-to-failure should be constant.


(*FAIL)

or

(*F)

With PCRE and PERL you can use this backtracking control verb that forces the pattern to fail immediatly.


After seeing some of these great answers, @arantius's comment (regarding timing $x vs x^ vs (?!x)x) on the currently accepted answer made me want to time some of the solutions given so far.

Using @arantius's 275k line standard, I ran the following tests in Python (v3.5.2, IPython 6.2.1).

TL;DR: 'x^' and 'x\by' are the fastest by a factor of at least ~16, and contrary to @arantius's finding, (?!x)x was among the slowest (~37 times slower). So the speed question is certainly implementation dependent. Test it yourself on your intended system before committing if speed is important to you.

UPDATE: There is apparently a large discrepancy between timing 'x^' and 'a^'. Please see this question for more info, and the previous edit for the slower timings with a instead of x.

In [1]: import re

In [2]: with open('/tmp/longfile.txt') as f:
   ...:     longfile = f.read()
   ...:     

In [3]: len(re.findall('\n',longfile))
Out[3]: 275000

In [4]: len(longfile)
Out[4]: 24733175

In [5]: for regex in ('x^','.^','$x','$.','$x^','$.^','$^','(?!x)x','(?!)','(?=x)y','(?=x)(?!x)',r'x\by',r'x\bx',r'^\b$'
    ...: ,r'\B\b',r'\ZNEVERMATCH\A',r'\Z\A'):
    ...:     print('-'*72)
    ...:     print(regex)
    ...:     %timeit re.search(regex,longfile)
    ...:     
------------------------------------------------------------------------
x^
6.98 ms ± 58.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
------------------------------------------------------------------------
.^
155 ms ± 960 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$x
111 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$.
111 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$x^
112 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$.^
113 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
$^
111 ms ± 839 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
(?!x)x
257 ms ± 5.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
(?!)
203 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
(?=x)y
204 ms ± 4.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
(?=x)(?!x)
210 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
x\by
7.41 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
------------------------------------------------------------------------
x\bx
7.42 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
------------------------------------------------------------------------
^\b$
108 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
\B\b
387 ms ± 5.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
------------------------------------------------------------------------
\ZNEVERMATCH\A
112 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
------------------------------------------------------------------------
\Z\A
112 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The first time I ran this, I forgot to raw the last 3 expressions, so '\b' was interpreted as '\x08', the backspace character. However, to my surprise, 'a\x08c' was faster than the previous fastest result! To be fair, it will still match that text, but I thought it was still worth noting because I'm not sure why it's faster.

In [6]: for regex in ('x\by','x\bx','^\b$','\B\b'):
    ...:     print('-'*72)
    ...:     print(regex, repr(regex))
    ...:     %timeit re.search(regex,longfile)
    ...:     print(re.search(regex,longfile))
    ...:     
------------------------------------------------------------------------
y 'x\x08y'
5.32 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
None
------------------------------------------------------------------------
x 'x\x08x'
5.34 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
None
------------------------------------------------------------------------
$ '^\x08$'
122 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
None
------------------------------------------------------------------------
\ '\\B\x08'
300 ms ± 4.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
None

My test file was created using a formula for " ...Readable Contents And No Duplicate Lines" (on Ubuntu 16.04):

$ ruby -e 'a=STDIN.readlines;275000.times do;b=[];rand(20).times do; b << a[rand(a.size)].chomp end; puts b.join(" "); end' < /usr/share/dict/words > /tmp/longfile.txt

$ head -n5 /tmp/longfile.txt 
unavailable speedometer's garbling Zambia subcontracted fullbacks Belmont mantra's
pizzicatos carotids bitch Hernandez renovate leopard Knuth coarsen
Ramada flu occupies drippings peaces siroccos Bartók upside twiggier configurable perpetuates tapering pint paralyzed
vibraphone stoppered weirdest dispute clergy's getup perusal fork
nighties resurgence chafe


All the examples involving a boundary matcher follows the same recipe. Recipe:

  1. Take any of the boundary matchers: ^,$,\b,\A,\Z,\z

  2. Do opposite to what they are meant for

Examples:

^ and \A are meant for the beginning so don't use them in beginning

^ --> .^
\A --> .\A

\b matches a word boundary so use it in between

\b --> .\b.

$, \Z and \z are meant for the end so don't use them in the end

$ --> $.
\Z --> \Z.
\z --> \z.

Others involve use of lookahead and lookbehind which also work with the same analogy: If you give positive or negative lookahead followed by something opposite

(?=x)[^x]
(?!x)x

If you give positive or negative lookbehind following something opposite

[^x](?<=x)
x(?<!x)

Their could be more such pattern and more such analogies.


As professionals mentioned, it depend to Regular Expression Engines and of course a performance benchmark depend to many things including device.

But as a reference about Performance for ECMAScript (Java/Javascript) or PCRE (PHP) the best from top to down is:

  1. [] | ^[] (Fastest) [Just ECMAScript]
  2. $^ (non-multi-line flag) (Fast)
  3. [^\S\s] | ^[^\S\s] | ^[^\W\w] | ^[^\D\d] (Fast)
  4. .^ (non-multi-line flag) (Fast)
  5. (?!\x00)\x00 | ^(?!\x00)\x00 | (?!\0)\0 (Fast)
  6. (?!a)a
  7. (?!) (Slow)
  8. (?=b)a (Slow)
  9. Other examples like \b\B etc... (Slowest)

A live try for Javascript (Not so accurate)

_Note: ^ = \A (PCRE) = at Start (non-multi-line) more info


Maybe this?

/$.+^/


'[^0-9a-zA-Z...]*'

and replace ... with all printable symbols ;). That's for a text file.


What about instead of regex, just use an always false if statement? In javascript:

var willAlwaysFalse=false;
if(willAlwaysFalse)
{
}
else
{
}


\A[^\w\W]

Works regardless of regex flags.

According to regex101: For empty input string, 0 steps. For all other input strings exactly 2 steps.

Kotlin playground: https://pl.kotl.in/hdbNH73It


^_^, which never matches and fails quickly.


A portable solution that will not depend on the regexp implementation is to just use a constant string that you are sure will never appear in the log messages. For instance make a string based on the following:

cat /dev/urandom | hexdump | head -20
0000000 5d5d 3607 40d8 d7ab ce72 aae1 4eb3 ae47
0000010 c5e2 b9e8 910d a2d9 2eb3 fdff 6301 c85f
0000020 35d4 c282 e439 33d8 1c73 ca78 1e4d a569
0000030 8aca eb3c cbe4 aff7 d079 ca38 8831 15a5
0000040 818b 323f 0b02 caec f17f 387b 3995 88da
0000050 7b02 c80b 2d42 8087 9758 f56f b71f 0053
0000060 1501 35c9 0965 2c6e 03fe 7c6d f0ca e547
0000070 aba0 d5b6 c1d9 9bb2 fcd1 5ec7 ee9d 9963
0000080 6f0a 2c91 39c2 3587 c060 faa7 4ea4 1efd
0000090 6738 1a4c 3037 ed28 f62f 20fa 3d57 3cc0
00000a0 34f0 4bc2 3067 a1f7 9a87 086b 2876 1072
00000b0 d9e1 6b8f 5432 a60e f0f5 00b5 d9ef ed6f
00000c0 4a85 70ee 5ec4 a378 7786 927f f126 2ec2
00000d0 18c5 46fe b167 1ae6 c87c 1497 48c9 3c09
00000e0 8d09 e945 13ce 7da2 08af 1a96 c24c c022
00000f0 b051 98b3 2bf5 4d7d 5ec4 e016 a50d 355b
0000100 0e89 d9dd b153 9f0e 9a42 a51f 2d46 2435
0000110 ef35 17c2 d2aa 3cc7 e2c3 e711 d229 f108
0000120 324e 5d6a 650a d151 bc55 963f 41d3 66ee
0000130 1d8c 1fb1 1137 29b2 abf7 3af7 51fe 3cf4

Sure, this is not an intellectual challenge, but more like duct tape programming.


new Regex(Guid.NewGuid().ToString())

Creates a pattern containing only alphanumerics and '-' (none of which are regex special characters) but it is statistically impossible for the same string to have appeared anywhere before (because that's the whole point of a GUID.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜