开发者

Regex execution is too slow in java [duplicate]

This question already has an answer here: Fixing Catastrophic Backtracking in Regular Expression (1 answer) Closed 2 years ago.

My purpose is to match this kind of different urls:

url.com

my.url.com

my.extended.url.com

a.super.extended.url.com

and so on...

So, I decided to build the regex to have a letter or a number at start and end of the url, and to have a infinite number of "subdomains" with alphanumeric characters and a dot. For example, in "my.extended.url.com", "m" from "my" is the first class of the regex, "m" from "com" is the last class of the regex, and "y.", "extended." and "url." are the second class of the regex.

Using the p开发者_JAVA技巧attern and subject in the code below, I want the find method to return me a false because this url must not match, but it uses 100% of CPU and seems to stay in an infinite loop.

    
    String subject = "www.association-belgo-palestinienne-be";
    Pattern pattern = Pattern.compile("^[A-Za-z0-9]\\.?([A-Za-z0-9_-]+\\.?)*[A-Za-z0-9]\\.[A-Za-z]{2,6}");

    Matcher m = pattern.matcher(subject);
    System.out.println("    Start");
    boolean hasFind = m.find();
    System.out.println("    Finish : " + hasFind);
  

Which only prints:

  
      Start
  

I can't reproduce the problem using regex testers.

Is it normal ? Is the problem coming from my regex ?

Could it be due to my Java version (1.6.0_22-b04 / JVM 64 bit 17.1-b03) ?

Thanks in advance for helping.


The problem is the ([A-Za-z0-9_-]+\\.?)* part of the regular expression. Note that it has a quantifier (+) inside another quantifier (*). This causes catastrophic backtracking - basically, it has to try an exponential number of matches in order to check the regular expression, at least the way most regular expression engines are implemented (including the Java one).

If you use possessive quantifiers, you will be able to avoid this problem, however that would change the meaning of your regex, and it would no longer match what you want it to match.

I think the trick here is to find a regex which expresses what you want to solve, without double quantifiers. For example, the following should work:

Pattern.compile("^[A-Za-z0-9]\\.?([A-Za-z0-9_-]|[A-Za-z0-9_-]\\.)*[A-Za-z0-9]\\.[A-Za-z]{2,6}$");

I think this expresses the same class of strings that you are trying to match, and should be much faster.


It's not an infinite loop. The problem is that it's checking every possible match and not finding one. If you could let it run for a gazillion years, it will eventually terminate. See this article for a good explanation of what's happening under the hood.

Perhaps this regular expression is satisfactory (it terminates on the given string): ^[A-Za-z0-9][A-Za-z0-9_-]*(\\.[A-Za-z0-9_-]+)*\\.[A-Za-z]{2,6}$ (see http://ideone.com/Z0rlg)


It isn't really an infinite loop, it's just taking a really long time. For all practical purposes, we can call it a hang.

Your Regex may be improved.

Try to put $ at the end of it. It will say that this is the end of the line. It may help you saving time.

Edit :

 String subject = "www-association-belgo-palestinienne-be";
 Pattern pattern = Pattern.compile("^[A-Za-z0-9]([-_A-Za-z0-9]*)(\\.[-_A-Za-z0-9]+)*\\.([-_A-Za-z0-9]+\\.)*([-_A-Za-z0-9]*)[A-Za-z0-9]$");

 Matcher m = pattern.matcher(subject);
 System.out.println("    Start");
 boolean hasFind = m.find();
 System.out.println("    Finish : " + hasFind);


See How do you debug a regex?.

Specifically, I would try regexpal, and change the java backslashes to single ones.


It is an obvious Java regexp implementation bug. Look at the results with Your regexp and input data here

and You will see how quickly this is evaluated

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜