substring match faster with regular expression?

2023-01-08 12:13 问答作者：

After having read up on RE/NF开发者_JAVA技巧A and DFA, it seems that finding a substring within a string might actually be asymptotically faster using an RE rather than a brute force O(mn) find. My reasoning is that a DFA would actually maintain state and avoid processing each character in the "haystack" more than once. Hence, searches in long strings may actually be much faster if done with regular expressions.

Of course, this is valid only for RE matchers that convert from NFA to DFA.

Has anyone experienced better string match performance in real life when using RE rather than a brute force matcher?

First of all, I would recommend you read the article about internals of regular expressions in several languages: Regular Expression Matching Can Be Simple And Fast.

Because regexps in many languages are not just for matching, but also provide possibility of group-capturing and back-referencing, almost all implementations use so called "backtracking" when execute NFA built from the given regexp. And this implementation has exponential time complexity (in worst case).

There could be RE implementation through the DFA (with group capturing), but it has an overhead (see Laurikari's paper NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions).

For simple substring searching you could use Knuth-Morris-Pratt algorithm, which build DFA to search substring, and it has optimal O(len(s)) complexity. But it hase overhead also, and if you test naive approach (O(nm)) against this optimal algorithm on real-world words and phrases (which are not so repetitive), you could find that naive approach is better in average.

For exact substring searching you could also try Boyer–Moore algo, which has O(mn) worst-case complexity, but work better than KMP in average on real-world data.

Most regular expressions used in practice are PCRE (Perl-Compatible Regular Expressions), which are wider than regular language and thus cannot be expressed with a regular grammar. PCRE has things like positive/negative lookahead/lookbehind assertions and even recursion, so parsing may require processing some characters more than once. Surely, it all comes down to particular RE implementation: whether it is optimized if the expressions stays within bounds of regular grammar or not.

Personally, I haven't done any sort of performance comparisons between the two. However, in my experience I never ever had performance issues with brute force find-and-replace, while I had to deal with RE performance bottlenecks on more than one occasion.

If you look at documentation for most languages it will mention that if you dont need to power of regex you should use the non-regex version for performance reasons... Example: http://www.php.net/manual/en/function.preg-split.php states: "If you don't need the power of regular expressions, you can choose faster (albeit simpler) alternatives like explode() or str_split()."

This is a trade off that exists everywhere. That is the more flexible and feature rich a solution is the poorer its performance.

继续阅读：regex regular-language string

substring match faster with regular expression?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？