Improve performance of a regexp

2023-01-06 11:15 问答作者：

My software allows users to use regexp to prepare files. I am in the process of adding a default regexp library with common expressions that can be re-used t开发者_开发知识库o prepare a variety of formats. One common task is to remove crlf in specific parts of the files, but not in others. For instance, this:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence
    that should not contain
    any line break.
    </SOURCE>

Should become:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence that should not contain any line break.
    </SOURCE>

I have a rexep that does the job pretty nicely:

(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))

The problem is that it is processing intensive and with files above 500kb, it can take 30+ seconds. (regex is compiled, in this case, uncompiled is much slower)

It's not a big issue, but I wonder is there is a better way to achieve the same results with Regex.

Thanks in advance for your suggestions.

Try this:

\r\n(?=(?>[^<>]*(?><(?!/?SOURCE>)[^<>]*)*)</SOURCE>)

It starts out by matching \r\n, then uses a lookahead to see if the match is between <SOURCE> and </SOURCE>. It does that by looking for a </SOURCE>, but if it finds <SOURCE> first it fails. Atomic groups prevent it from saving the state information that would be needed for backtracking, because pass or fail, backtracking is never necessary.

I would prepend a negative lookahead assertion to the regex to make sure that you can in fact match a \r\n in the current position. Otherwise, the engine has to do the entire lookbehind (arbitrary-size to boot) on every single character in the entire file, only to find out then that there is no carriage return to be replaced.

(?=\r\n)(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))

should be a lot faster. At least in RegexBuddy, the regex engine needs a lot fewer steps to complete the match. If it doesn't in .NET, I don't know why. Perhaps the conditional regex is not as efficient (I must admit that at first I didn't recognize it and thought that there was a syntax error in your regex). I think you don't need a conditional regex in this scenario. How about

\r\n(?<=<SOURCE>(?:(?!</?SOURCE>).)*)

Is this faster? I'm assuming you're using RegexOptions.Singleline to compile the regex.

If not, well, then there probably are very many carriage returns and many other characters inside your <SOURCE> blocks, and the arbitrary-size lookbehind simply takes a long time. Then my other suggestion is to split up the task:

Match the <SOURCE> block
Replace all CRLFs inside the block (no regex required)
Replace the <SOURCE> block

"Compiling" a regular expression simply means converting it from a Deterministic Finite Automaton to a Non-deterministic Finite Automaton. It isn't "compiling into machine code" as you might expect.

NFAs are typically smaller than their corresponding DFA and can execute more space efficiently. Every DFA has at least one equivalent NFA and vice-versa. However, Perl Compatible Regular Expressions are aren't actually Regular. So I don't know that they are converted into NFAs or if "compiling" just another form of lexical analysis that done once need not be done again.

PCREs are slow according to Russ Cox, partly because of their non-regularity, and your expression above is quite non-regular.

Oh, and if you are trying to recognize nested tags with regexps, don't. Use a real (X|HT)?ML parser. You really don't want the pony to come

继续阅读：.net optimization regex

Improve performance of a regexp

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？