Adding a single character to my .NET RegEx causes it to hang

2023-01-02 01:06 问答作者：

Here is the input data:

                                *** INVOICE ***                                

                              THE BIKE SHOP                              
                      1 NEW ROAD, TOWNVILLE,                       
                          SOMEWHERE, UK, AB1 2CD                          
                        TEL 01234-567890  

 To: COUNTER SALE                                   No:  243529 Page: 1

                                                    Date: 04/06/10 12:00

                                                    Ref:    Aiden   

 Cust No: 010000

Here is a regex that works (Options: singleline, ignorewhitespace, compiled) - it matches immediately and the groups are properly populated:

\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust

As soon as I add the 'N' out of Cust No into the rex, parsing the input hangs forever:

\W+INVOICE\W+
(?<shopAddr&g开发者_高级运维t;.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust N

If I add something like "any character" :

\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust .

It works, but as soon as I add a fixed character, the rex hangs again:

\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust ..:

Can anyone advise why adding something so trivial would cause it to fall over? Can I enable some kind of tracing to watch the matching activity to see if it is getting stuck in a catastrophic backtrack?

With RegexOptions.IgnorePatternWhitespace, you're telling the engine to ignore whitespaces in your pattern. Thus, when you write Cust No in the pattern, it really means CustNo, which doesn't match the input. This is the cause of the problem.

From the documentation:

By default, white space in a regular expression pattern is significant; it forces the regular expression engine to match a white-space character in the input string. [...]

The RegexOptions.IgnorePatternWhitespace option, or the x inline option, changes this default behavior as follows:

Unescaped white space in the regular expression pattern is ignored. To be part of a regular expression pattern, white-space characters must be escaped (e.g. as \s or "\ ").

So instead of Cust No, in IgnorePatternWhitespace mode, you must write Cust\ No, because otherwise it's interpreted as CustNo.

polygenelubricants already explained why your regex failed. The reason it hangs is that you're running into catastrophic backtracking. Your regex has many parts that can match the same text in a lot of different ways. If the overall match fails, the regex engine will try all possible permutations until it either exhausts them all or aborts with a Stack Overflow.

E. g. in To:\W+(?<custAddr>.*?)\W+ the .*? will gladly match the same characters as \W, and since you're using Singleline, the .*? will also cross over into the No:... part of the input text and further and further. In your example, I tested in RegexBuddy what happens if you add the "N" after "Cust" - the regex engine aborts after 1,000,000 steps.

To avoid this, you need to make the regex more specific, or (this might be the better option in this case) keep the regex engine from backtracking by enclosing parts that have already matched in "atomic groups":

(?>\W+INVOICE\W+)
(?>(?<shopAddr>.*?)\W+To:)
(?>\W+(?<custAddr>.*?)\W+No:)
(?>\W+(?<invNo>\d+).*?Date:)
(?>\W+(?<invDate>[0-9/\ :]+)\W+Ref:)
(?>\W+(?<ref>[\w\ ]*?)\W+Cust)

This allows the regex to fail much faster if the input and the regex happen not to fit together.

Tim Pietzcker is really onto something here when trying to avoid catastrophic backtracing. .NET has a missing feature called "possessive quantifiers". It basically means the regular expression will be as greedy as possible and not give anything up when backtracing.

For example, if you were to match the expression [abc]+c on "abc", it will succeed. The [abc]+ will first match all three characters, then the final c will fail because it has reached the end of the line. That will cause a backtrack and match just "ab", which leaves the c for a successful match.

Where if you try and match expression [abc]++c on "abc", it will fail. The [abc]++ will first match all three characters, then the final c will fail because it has reached the end of the line. However, this time there will not be a backtrack because of the posessive quantifier (the extra plus sign +), and the expression will fail to match.

Tim Pietzcker has pointed out an alternative to using a posessive quantifier. An atomic group can keep the regular expression from catastrophic backtracing. So for all practical purposes, the possessive expression [abc]++c and the atomic expression (?>[abc]+)c are equivalent.

You have saved me a lot of time. Thank you.

继续阅读：.net freeze regex

Adding a single character to my .NET RegEx causes it to hang

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？