Adding a single character to my .NET RegEx causes it to hang
Here is the input data:
*** INVOICE ***
THE BIKE SHOP
1 NEW ROAD, TOWNVILLE,
SOMEWHERE, UK, AB1 2CD
TEL 01234-567890
To: COUNTER SALE No: 243529 Page: 1
Date: 04/06/10 12:00
Ref: Aiden
Cust No: 010000
Here is a regex that works (Options: singleline, ignorewhitespace, compiled) - it matches immediately and the groups are properly populated:
\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust
As soon as I add the 'N' out of Cust No into the rex, parsing the input hangs forever:
\W+INVOICE\W+
(?<shopAddr&g开发者_高级运维t;.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust N
If I add something like "any character" :
\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust .
It works, but as soon as I add a fixed character, the rex hangs again:
\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust ..:
Can anyone advise why adding something so trivial would cause it to fall over? Can I enable some kind of tracing to watch the matching activity to see if it is getting stuck in a catastrophic backtrack?
With RegexOptions.IgnorePatternWhitespace
, you're telling the engine to ignore whitespaces in your pattern. Thus, when you write Cust No
in the pattern, it really means CustNo
, which doesn't match the input. This is the cause of the problem.
From the documentation:
By default, white space in a regular expression pattern is significant; it forces the regular expression engine to match a white-space character in the input string. [...]
The
RegexOptions.IgnorePatternWhitespace
option, or thex
inline option, changes this default behavior as follows:
- Unescaped white space in the regular expression pattern is ignored. To be part of a regular expression pattern, white-space characters must be escaped (e.g. as
\s
or"\ "
).
So instead of Cust No
, in IgnorePatternWhitespace
mode, you must write Cust\ No
, because otherwise it's interpreted as CustNo
.
polygenelubricants already explained why your regex failed. The reason it hangs is that you're running into catastrophic backtracking. Your regex has many parts that can match the same text in a lot of different ways. If the overall match fails, the regex engine will try all possible permutations until it either exhausts them all or aborts with a Stack Overflow.
E. g. in To:\W+(?<custAddr>.*?)\W+
the .*?
will gladly match the same characters as \W
, and since you're using Singleline
, the .*?
will also cross over into the No:...
part of the input text and further and further. In your example, I tested in RegexBuddy what happens if you add the "N" after "Cust" - the regex engine aborts after 1,000,000 steps.
To avoid this, you need to make the regex more specific, or (this might be the better option in this case) keep the regex engine from backtracking by enclosing parts that have already matched in "atomic groups":
(?>\W+INVOICE\W+)
(?>(?<shopAddr>.*?)\W+To:)
(?>\W+(?<custAddr>.*?)\W+No:)
(?>\W+(?<invNo>\d+).*?Date:)
(?>\W+(?<invDate>[0-9/\ :]+)\W+Ref:)
(?>\W+(?<ref>[\w\ ]*?)\W+Cust)
This allows the regex to fail much faster if the input and the regex happen not to fit together.
Tim Pietzcker is really onto something here when trying to avoid catastrophic backtracing. .NET has a missing feature called "possessive quantifiers". It basically means the regular expression will be as greedy as possible and not give anything up when backtracing.
For example, if you were to match the expression [abc]+c on "abc", it will succeed. The [abc]+ will first match all three characters, then the final c will fail because it has reached the end of the line. That will cause a backtrack and match just "ab", which leaves the c for a successful match.
Where if you try and match expression [abc]++c on "abc", it will fail. The [abc]++ will first match all three characters, then the final c will fail because it has reached the end of the line. However, this time there will not be a backtrack because of the posessive quantifier (the extra plus sign +), and the expression will fail to match.
Tim Pietzcker has pointed out an alternative to using a posessive quantifier. An atomic group can keep the regular expression from catastrophic backtracing. So for all practical purposes, the possessive expression [abc]++c and the atomic expression (?>[abc]+)c are equivalent.
You have saved me a lot of time. Thank you.
精彩评论