100% CPU usage with a regexp depending on input length

2023-03-10 10:19 问答作者：

I'm trying to come up with a regexp in Python that has to match any character but avoiding three or more consecutive commas or semicolons. In other words, only up to two consecutive commas or semicolons are allowed.

So this is what I currently have:

^(,|;){,2}([^,;]+(,|;){,2})*$

And it seems to work as expected:

>>> r.match('')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo,')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, a')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, ,')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, ,,a')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, ,,,')
>>> r.match('foo, ,,,;')
>>> r.match('foo, ,, ;;')
<_sre.SRE_Match object at 0x7f23af840750>

But as I start to increase the length of the input text, the regexp seems to need way more time to give a response.

>>> r.match('foo, bar, baz,, foo')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, bar, baz,, fooooo, baaaaar')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, bar, baz,, fooooo, baaaaar,')
<_sre.SRE_Match object at 0x7f23af8407e8>
>>> r.match('foo, bar, baz,, fooooo, baaaaar,,')
<_sre.SRE_Match object at 0x7f23af840750>
>>> r.match('foo, bar, baz,, fooooo, baaaaar,,,')
>>> r.match('foo, bar,开发者_StackOverflow中文版 baz,, fooooo, baaaaar,,,,')
>>> r.match('foo, bar, baz,, fooooo, baaaaar, baaaaaaz,,,,')

And finally it gets completely stuck at this stage and the CPU usage goes up to 100%.

I'm not sure if the regexp could be optimized or there's something else involved, any help appreciated.

You're running into catastrophic backtracking.

The reason for this is that you have made the separators optional, and therefore the [^,;]+ part (which is itself in a repeating group) of your regex will try loads of permutations (of baaaaaaaz) before finally having to admit failure when confronted with more than two commas.

RegexBuddy aborts the match attempt after 1.000.000 steps of the regex engine with your last test string. Python will keep trying.

Imagine the string baaz,,,:

Trying your regex, the regex engine has to check all these:

baaz,,<failure>
baa + z,,<failure>
ba + az,,<failure>
ba + a + z,,<failure>
b + aaz,,<failure>
b + aa + z,,<failure>
b + a + az,,<failure>
b + a + a +z,,<failure>

before declaring overall failure. See how this grows exponentially with each additional character?

Behavior like this can be avoided with possessive quantifiers or atomic groups, both of which are sadly not supported by Python's current regex engine. But you can do an inverse check easily:

if ",,," in mystring or ";;;" in mystring:
    fail()

without needing a regex at all. If ,;, and the likes could also occur and should be excluded, then use Andrew's solution.

I think the following should do what you want:

^(?!.*[,;]{3})

This will fail if the string contains three or more , or ; in a row. If you actually want it to match a character add a . at the end.

This utilizes negative lookahead, which will cause the entire match to fail if the regex .*[,;]{3} would match.

Try this regular expression:

^([^,;]|,($|[^,]|,[^,])|;($|[^;]|;[^;]))*$

It matches repetitively:

one single character that is neither , nor ;, or
a , that is either not followed by another , or a ,, that is not followed by another ,, or
a ; that is either not followed by another ; or a ;; that is not followed by another ;

until the end is reached. It is very efficient as it fails early without doing much backtracking.

How about this idea match the ones that have the pattern you don't want ".+,,," In Python just keep those that do not match. Should be fast

继续阅读：cpu-usage python regex

100% CPU usage with a regexp depending on input length

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？