开发者

Python's Regular Expression Source String Length

In Python Regular Expressions,

re.compile("x"*50000)

gives me OverflowError: regular expression code size limit exceeded

but following one does not get any error, but it hits 100% CPU, and took 1 minute in my PC

>>> re.compile(".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000)
<_sre.SRE_Pattern object at 0x03FB0020>

Is that normal?

Should I assume, ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 is shorter than "x"*50000?

Tested on Python 2.6, Win32

UPDATE 1:

It Looks like ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 could be reduce to .*?

So, how about this one?

re.compile(".*?x"*50000)

It does compile, and if that one also can reduce to ".*?x", it should match to string "abcx" or "x" alone, but it does not match.

开发者_StackOverflow中文版

So, Am I missing something?

UPDATE 2:

My Point is not to know max limit of regex source strings, I like to know some reasons/concepts of "x"*50000 caught by overflow handler, but not on ".*?x"*50000.

It does not make sense for me, thats why.

It is something missing on overflow checking or Its just fine or its really overflowing something?

Any Hints/Opinions will be appreciated.


The difference is that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 can be reduced to ".*?", while "x"*50000 has to generate 50000 nodes in the FSM (or a similar structure used by the regex engine).

EDIT: Ok, I was wrong. It's not that smart. The reason why "x"*50000 fails, but ".*?x"*50000 doesn't is that there is a limit on size of one "code item". "x"*50000 will generate one long item and ".*?x"*50000 will generate many small items. If you could split the string literal somehow without changing the meaning of the regex, it would work, but I can't think of a way to do that.


you want to match 50000 "x"s , correct??? if so, an alternative without regex

if "x"*50000 in mystring:
    print "found"

if you want to match 50000 "x"s using regex, you can use range

>>> pat=re.compile("x{50000}")
>>> pat.search(s)
<_sre.SRE_Match object at 0xb8057a30>

on my system it will take in length of 65535 max

>>> pat=re.compile("x{65536}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/re.py", line 188, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.6/re.py", line 241, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python2.6/sre_compile.py", line 529, in compile
    groupindex, indexgroup
RuntimeError: invalid SRE code
>>> pat=re.compile("x{65535}")
>>>

I don't know if there are tweaks in Python we can use to increase that limit though.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜