开发者

Regex MatchCollection obj hangs\"Function evuluation timed out" after Regex.Matches

I'm kind of new too C#, and regular expression for that matter, but I've searched a couple of hours to find a solution too this problem so, hopefully this is easy for you guys:)

My application uses a regex to match email addresses in a given string, then loops throu the matches.:

String EmailPattern = "\\w+([-+.]\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*";
MatchCollection mcemail = Regex.Matches(rawHTML, EmailPattern);

foreach (Match memail in mcemail)

Works fine, but, when I downloaded the string from a certain page, http://www.sp.se/sv/index/services/quality/sidor/default.aspx, the MatchCollection(mcemail) object "hangs" the loop. When using a break poi开发者_JAVA技巧nt and accessing the object, I get "Function evuluation timed out" on everything(.Count etc).

Update I've tried my pattern and other email patterns on the same string, everyone(regex desingers, python based web pages etc.) fails/timesout when trying too match this particular string.

How can I detect that the matchcollection obj is not "ready" to use?


If you can post the email that's causing the problem (perhaps anonymized in some way), that will give us more information, but I'm thinking the problem is this little guy right here:

([-.]\\w+)*\\.\\w+([-.]\\w+)*

To understand the problem, let's break that into groups:

([-.]\\w+)*
\\.\\w+
([-.]\\w+)*

The strings that will match \\.\\w+ are a subset of those that will match [-.]\\w+. So if part of your input looks like foo.bar.baz.blah.yadda.com, your regex engine has no way of knowing which group is supposed to match it. Does that make sense? So the first ([-.]\\w+)* could match .bar.baz.blah, then the \\.\\w+ could match .yadda, then the last ([-.]\\w+)* could match .com...

...OR the first clause could match .bar.baz, the second could match .blah, and the last could match .yadda.com. Since it doesn't know which one is right, it will keep trying different combinations. It should stop eventually, but that could still take a long time. This is called "catastrophic backtracking".

This issue is compounded by the fact that you're using capturing groups rather than non-capturing groups; i.e. ([-+.]\\w+) instead of (?:[-+.]\\w+). That causes the engine to try and separate and save whatever matches inside the parentheses for later reference. But as I explained above, it's ambiguous which group each substring belongs in.

You might consider replacing everything after the @ with something like this:

\\w[-\\w]*\\.[-.\\w]+

That could use some refinement to make it more specific, but you get the general idea. Hope I explained all this well enough; grouping and backreferences are kind of tough to describe.

EDIT:

Looking back at your pattern, there's a deeper issue here, still related to the backtracking/ambiguity problem I mentioned. The clause \\w+([-.]\\w+)* is ambiguous all by itself. Splitting it into parts, we have:

\\w+
([-.]\\w+)*

Suppose you have a string like foobar. Where does the \\w+ end and the ([-.]\\w+)* begin? How many repetitions of ([-.]\\w+) are there? Any of the following could work as matches:

f(oobar)
foo(bar)
f(o)(oba)(r)
f(o)(o)(b)(a)(r)
foobar
etc...

The regex engine doesn't know which is important, so it will try them all. This is the same problem I pointed out above, but it means you have it in multiple places in your pattern.

Even worse, ([-.]\\w+)* is also ambiguous, because of the + after the \\w. How many groups are there in blah? I count 16 possible combinations: (blah), (b)(lah), (bl)(ah)...

The amount of different possible combinations is going to be huge, even for a relatively small input, so your engine is going to be in overdrive. I would definitely simplify it if I were you.


I just did a local test and it appears either the sheer document size or something in the ViewState causes the Regex match evaluation to time out. (Edit: I'm pretty sure it's the size, actually. Removing the ViewState just reduces the size significantly.)

An admittedly crude way to solve this would be something like this:

string[] rawHtmlLines = File.ReadAllLines(@"C:\default.aspx");
string filteredHtml = String.Join(Environment.NewLine, 
    rawHtmlLines.Where(line => !line.Contains("_VIEWSTATE")).ToArray());
string emailPattern = @"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";
var emailMatches = Regex.Matches(filteredHtml, emailPattern);

foreach (Match match in emailMatches)
{
    //...
}

Overall I suspect the email pattern is just not well optimised (or intended) to filter out emails in a large string but just used as validation for user input. Generally it might be a good idea to limit the string you search in to just the parts you are actually interested in and keep it as small as possible - for example by leaving out the ViewState which is guaranteed to not contain any readable email addresses.

If performance is important, it's probably also a better idea to create the filtered HTML using a StringBuilder and IndexOf (etc.) instead of splitting lines and LINQing up the result :)

Edit:

To further minimize the length of the string the Regex needs to check you could only include lines that contain the @ character to begin with, like so:

string filteredHtml = String.Join(Environment.NewLine, 
    rawHtmlLines.Where(line => line.IndexOf('@') >= 0 && !line.Contains("_VIEWSTATE")).ToArray());


From "Function evaluation timed out", I'm assuming you're doing this in the debugger. The debugger has some fairly quick timeouts with regard to how long a method takes. Not eveything happens quickly. I would suggest going the operation in code, storing the result, then viewing that result in the debugger (i.e. let the call to Matches run and put a breakpoint after it).

Now, with regard to detecting whether the string will make Matches take a long time; that's a bit of a black art. You basically have to perform some sort of input validation. Just because you got some value from the internet, doesn't mean that value will work well with Matches. The ultimate validation logic is up to you; but, starting with the length of rawHtmlLines might be useful. (i.e. if the lenght is 1000000 bytes, Matches might take a while) But, you have to decide what to do if the length is too long; e.g give an error to the user.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜