开发者

String manipulation vs Regexps

We are often told that Regexps are slow and should be avoided whenever possible.

However, taking into account the overhead of doing some string manipulation oneself (not talking about algorithm mistakes - this is a different matter), especially in PHP or Perl (maybe Java) what is the limit, in which case can we consider string manipulation to be a better alternative? What regexps are particularly CPU greedy?

For instance, for the following, in C++, Java, PHP or Perl, what would you recommend

The regexps would probably be faster:

  • s/abc/def/g or a ... while((i=index("abc",$x)>=0) ...$y .= substr()... based solution?
  • s/(\d)+/N/g or a scanning algorithm

But what about

  • an email validation regexp?
  • s/((0|\w)+?[xy]*[^xy]){2,7}/u/g

wouldn't a handmade and specific algorithm be faster (while longer to write)?

edit

The point of the question is to determine what kind of regexp would better be rewritten specifically for a given problem via string manipulation?

edit2

A common implementation is Perl regexp. For instance in Perl - that requires to know how they are implemented - what kin开发者_C百科d of regexp is to be avoided, because the implementation will make the process lengthy and ineffective? It may not be a complex regexp...

edit July 2011 (based on comments)

I'm not saying all regexps are slow. Some particular regexps patterns are known to be slow, due to the particular processing their and due to their implementation.

In recent Perl / PHP implementations for instance, what is known to be rather slow - and should be avoided?

The answer is expected from people who did already their own research (profiler...) and who are able to provide a kind of general guidelines about what is recommended/to be avoided.


Who said regexes were slow? At least in Perl they tend to be the preferred method of manipulating strings.

Regexes are bad at some things like email validation because the subject is too complex, not because they are slow. A proper email validation regex is over 6,000 characters long, and it doesn't even handle all of the cases (you have to strip out comments first).

At least in Perl 5, if it has a grammar it probably shouldn't be parsed with one regex.

You should also rewrite a regex as a custom function if the regex has grown to the point it can no longer be easily maintained (see the previous email validation example) or profiling shows that the regex is the slow component of your code.

You seem to be concerned with the speed of the regex vs the custom algorithm, but that is not a valid concern until you prove that it is with a profiler. Write the code in the most maintainable way. If a regex is clear, then use a regex. If a custom algorithm is clear, then use a custom algorithm. If you find that either is eating up a lot of time after profiling your code, then start looking for alternatives.


A nice feature of manipulating text with regular expressions is that patterns are high-level and declarative. This leaves the implementation considerable room for optimization such as factoring out the longest common prefix or using Boyer-Moore for static strings. Concise notation makes for quicker reading by experts. I understand immediately what

if (s/^(.)//) {
  ...
}

is doing, and index($_, 0, 1) = "" looks noisy in comparison.

Rather than the lower bound, the important consideration for regular expressions is the upper bound. It's a powerful tool, so people believe it's capable of correctly extracting tokens from XML, email addresses, or C++ programs and don't realize that an even more powerful tool such as a parser is necessary.


Regular expressions will never be faster than a hand-made algorithm for a very specific purpose. Worse, in PHP they have to be compiled the first time they're used (a cache is used afterwards).

However, they are certainly more succinct. Moreover, writing custom algorithms is often slower than regexes because the regular expressions engine are usually implemented in a more low level language, with less overhead in calling functions, etc.

For instance, preg_replace('/a/', 'b', $string) will almost certainly be faster than looping in PHP through the string. But this is a constant penalty in execution time and sometimes regular expressions, due to backtracking, can have a much worse asymptotic behavior.

You are strongly encouraged to know how regular expressions are implemented so that you can know when you're writing inefficient ones.


Some regular expressions are extremely fast and the difference between the regex and a custom solution may be negligible (or not worth anyone's time to bother).

The cases where regular expressions are slow, however, is when excessive backtracking occurs. Regular expressions parse from left to right and have the potential to match text in more than one way. So if they reach a point where the engine realizes that the pattern isn't going to match your test string, then it may start over and try to match in another way. This repeated backtracking adds up and slows down the algorithm.

Often the regular expression can be rewritten to perform better. But the ultimate in performance would be to write your own optimized parser for the specific task. By writing your own parser you can for example parse from left to right while maintaining a memory (or state). If you use this technique in procedural code you can often achieve the effect you're looking for in one pass and without the slowness of backtracking.

I was faced with this decision earlier this year. In fact the task at hand was on the outer fringe of what was even possible with regular expressions. In the end I decided to write my own parser, an embedded pushdown automaton, which is incredibly efficient for what I was trying to do. The task, by the way, was to build something that can parse regular expressions and provide Intellisense-like code hinting for them.

It's somewhat ironic that I didn't use regular expressions to parse regular expressions, but you can read about the thought behind it all here... http://blog.regexhero.net/2010/03/code-hinting-for-regular-expressions.html


what kind of regexp would better be rewritten specifically for a given problem via string manipulation?

Easy.

  1. Determine if you ever need to rewrite anything.
    (positive answer would be for about 1 per 10000 scripts, massive text parsing, resource critical)
  2. Do profile possible solutions.
  3. Use one suits you for a given problem

As for the rest 9999 cases do not waste your time with such a trifle problem and use whatever you like more.

Every time you ask yourself such a question, it is extremely useful to remind yourself that by default all your extra-optimized and super-fast code being parsed char by char on every user request. No brain-cracking regexps, no devious string manipulation, but just old good picking chars one by one.


Regexes aren't slow. But implementation can be slow, mostly because it is often interpreted and build again each time when they are used. But good regexp library allows you to use compiled versions. They are pretty fast.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜