开发者

How can I speed up my Perl regex matching?

I want to capture several text using the following regex:

$text_normal = qr{^(\/F\d+) FF (.*?) SCF SF (.*?) MV (\(.*?)SH$};

A sample of the string is like below:

my $text = '/F12345 FF FF this is SCF SF really MV (impo开发者_开发问答rtant stuff SH';

Can that be rewritten to speed up the matching?


There's no single answer to optimizing a regex. You can watch what a particular regex is doing with the re pragma:

 use re 'debugcolor';

Once you see what it traverses the string, you see where it is having problems and adjust your regex from there. You'll learn a little about the regex engine as you do that.

You should also check out Mastering Regular Expressions, which tells you how regular expressions work and why some patterns are slower than others.


Without seeing some sample data it is hard to say.

Generally, it is a good idea to avoid using .*. Look for any possible sources of unneeded backtracking, and eliminate them.

You might be able to get away with a split with a slice if your needs are simple.

 my @vals = (split / /, $string)[0,2,5,7];


This very much depends on the profile of the data you are scanning.

You identify the piece of your regular expression which filters out the most input and do a separate simpler regular expression for that expression.

For instance, if only 5% of your input date contained the 'MV' string, you could filter for this first and only apply the full more complex regular expression if the simpler one is true.

So you would have:

if ( $text_normal =~ / MV / ) {
  $text_normal = qr{^(\/F\d+) FF (.*?) SCF SF (.*?) MV (\(.*?)SH$};
  if .......
  }
}


(.*) means that you are dealing with any number of repetitions of " SCF SF " before you find the one that indicates it's the next capture. By making it non-greedy, you're still handling the capability that even 'SCF SF' would appear in the capture after 'FF'. I think you are handling a lot of cases you don't need.

The best way to optimize a regular expression sometimes makes it more cryptic--but you definitely find ways to make the expression fail earlier. (.*?) while not being "greedy" is definitely too tolerant.

Below is a more verbose, but faster failing alternative to your second capture.

((?:[^S]|S[^C]|SC[^F]|SCF[^ ]|SCF [^S]|SCF S[^F])*)

But you can optimize it even more if you think that the string \bSCF\b should automatically make the capture commit and expect only "\bSCF SF\b". Thus you could re-write that as:

((?:[^S]|S[^C]|SC[^F]SCF\B)*) SCF SF

But you can optimize these strings even more by backtracking control. If you think that there is no way in the world that SCF would ever occur as a word and not be followed by SF on valid input. To do that, you add another group around it, with brackets (?> and ).

(?>((?:[^S]|S[^C]|SC[^F]SCF\B)*)) SCF SF

That means that the matching logic will in no way try to reassess what it captured. If the characters after this fail to be " SCF SF " the whole expression fails. And it fails long before it ever gets to try to accommodate "MV" and other sub-expressions.

In fact, given certain expressions about the uniqueness of the delimiters, the fastest performance for this expression would be:

$text_normal = qr{^(\/F\d+) FF (?>((?:[^S]|S[^C]|SC[^F]SCF\B)*))SCF SF (?>((?:[^M]|M[^V]|MV\B)*))MV (?>(\((?:[^S]|S[^H]|SH.)*))SH$};

Additionally, the verbose, exhaustive negative matches can be alternative expressed by negative lookaheads--but I have no idea on how that works on performance. But negative lookaheads would work like this:

((?:.(?! SCF))*) SCF SF

This means that for this capture I want any character that is not a space starting the string " SCF SF ".


Backtracking is one of the surest ways to kill regex performance, but, unfortunately, this does not appear to be a case where you're able to completely eliminate the . wildcard in favor of character classes, unless the text you're capturing is prohibited from containing uppercase characters. (If that prohibition does exist, you could replace your .*? with, say, [a-z ]* instead.)

You can still reduce the possibility for backtracking by using {} to set a minimum/maximum number of characters to match, such as .{0,10}? if the match can't be longer than 10 characters.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜