Exponential Regex Problem
Can someone help me rewrite this regex to be non-exponential?
I'm using perl to parse email data. I want to extract email addresses from the data. Here is a shortened version of the regex that I've been using:
my $email_address = qr/(?:[^\s@<>,":;\[\]\(\)\\]+?|"[^\"]+?")@/i
For simplicity I've removed the later domain part of the regex. (It isn't causing any problems.)
This will f开发者_开发百科ind an RFC compliant email address that either contains non-email meta chars OR a "quoted" string followed by @. Using the OR '|' part of the regex with the two different multicharacter patterns creates an exponential problem.
The problem is, when I unleash this on a line of data that is several thousands of characters long.
$ wc line7.txt
1 221 497819 line7.txt
(I'm sorry but I cannot provide input data at this time, I may be able to mock some up later.)
Much like rewriting (a*b*)* to (a|b)*, I need to rewrite this regex.
Splitting it into two separate regex's creates more work in code changes then I am willing to perform at this point. Although it would solve my problem.
The eventual target machine is on a Hadoop cluster. So I would like to avoid CPAN modules that don't come with Hadoop's version of perl. (I'll have to check if Email::Find can even be used.) This is a problem I encountered at work.
Have you considered the CPAN modules Email::Valid and Email::Find?
Unless this is for your own fun or education, you almost certainly shouldn't be trying to write your own email address matching regex. See Mastering Regular Expressions by Jeffrey Friedl if you want to know what such a thing actually looks like. (Hint: it's 6,598 bytes long.)
qr/(?:(?>[^\s@<>,":;\[\]\(\)\\])+|"[^\"]{0,62}")@/i
The (?>expression)
part prevents backtracking. It should be safe because there can be no overlap between the non-quoted part and the quoted part.
I removed the lazy repeats +?
because the parts of the alternation already look for the @
and "
respectively. Phrases could be a large source of backtracking, so I looked at the Wikipedia article which states that the local part (before the @) can be only 64 characters long (subtracting two quotes yields {0,62}
(if ""@
is not valid, then change it to {1,62}
.... I do not intend for this to be a completely functional email parser. That is your job. I simply provide help for the catastrophic backtracking.) Best of luck!
Non-greedy matches are expensive as I understand it, if you are not careful. It may do lots and lots of backtracking. http://blog.stevenlevithan.com/archives/greedy-lazy-performance
One trick I often use is to destructively pull bits of the data out once I figure out it cannot hold any data. Another trick is to do a non-backtrack match (\@{1}+ or the like) if there is something which might signal to you that there is absolutely an email address which you need to parse around there.
In your specific example, perhaps you can limit the number of characters that can be in an email address? Instead of + on the left-hand-side of the @, use {1,80}
Just changing the +?
to +
should do it; the ?
says to prefer matching as few times as possible, which is not at all what you want.
Either I'm mis-seeing something, or your problem is in the part of the regex you aren't showing us. Or there's some difference between what you are showing and what you are actually trying. In any case, you may try changing the +?
to ++
or enclosing the whole (?:...)@
in (?> ... )
.
Is there a +
before the @
in your actual regex? If so, just changing the (?:
to (?>
and making that +
be ++
would be a very good idea.
If many lines do not contain an E-mail address, how about a quick pre-test before applying the RE:
if ( my $ix = index( $line, '@' ) > 0 ) { #test E-mail address here . . . #and another wild idea you could try to cut down lengths of strings actually parsed: my $maxLength = 100; #maximum supported E-mail address length (up to the @) if ( substr( $line, MAX( $ix - $maxLength, 0), $maxLength ) =~ /YourRE/ ) }
(yes, > any line starting with a @ can not be an E-mail address)
精彩评论