开发者

Java Regular Expressions using Pattern and Matcher

My question is related to Regular Expressions in Java, and in particular, multiple matches for a given search pattern. All of the info i need to get is on 1 line and it contains an alias (e.g. SA) which maps to an IP address. Each one is separated by a comma. I need to extract each one.

SA "239.255.252.1", SB "239.255.252.2", SC "239.255.252.3", SD "239.255.252.4"

My Reg Ex looks like this:

开发者_开发知识库Pattern alias = Pattern.compile("(\\S+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"");  
Matcher match = alias.matcher(lineInFile)  
while(match.find()) {  
   // do something  
}

This works but I'm not totally happy with it because since introducing this small piece of code, my program has slowed down a bit (< 1 sec) but enough to notice a difference.

So my question is, am I going about this in the correct manner? Is there a more efficient or possibly lightweight solution without the need for a while(match) loop? and/or Pattern/Matcher classes?


If the line may not contain anything except that alias definition, then using .match() instead of .find() might speed up the searching on non-matches.


You can improve your regex to: "(\\S{2})\\s+\"((\\d{1,3}\\.){3}\\d{1,3})\"" by specifying an IP address more explicitly.

Try out the performance of using a StringTokenizer. It does not use regular expressions. (If you are concerned about using a legacy class, then take a look at its source and see how it is done.)

StringTokenizer st = new StringTokenizer(lineInFile, " ,\"");
while(st.hasMoreTokens()){
    String key = st.nextToken();
    String ip = st.nextToken();
    System.out.println(key + " ip: " +  ip);
}


I don't know if this will yield a big performance benefit, but you could also first do

string.split(", ") // separate groups

and then

string.split(" ?\"") // separate alias from IP address

on the matches.


Precompiling and reusing the Pattern object is (IMO) likely to be the most effective optimization. Pattern compilation is potentially an expensive step.

Reusing the Matcher instance (e.g. using reset(CharSequence)) might help, but I doubt that it will make much difference.

The regex itself cannot be optimized significantly. One possible speedup would be to replace (\d+\.\d+\.\d+\.\d+) with ([0-9\.]+). This might help because it reduces the number of potential backtrack points ... but you'd need to do some experiments to be sure. And the obvious downside is that it matches character sequences that are not valid IP addresses.


If you`re noticing a difference of < 1 sec on that piece of code, then your input string must contain around a million (ot at least some 100k) of entries. I think that's a pretty fair performance and I cannot see how you could significantly optimize that without writing your own specialized parser.


I'm afraid your code looks pretty efficient already. Here's my version:

Matcher match = Pattern
                .compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"")
                .matcher(lineInFile);  
while(match.find()) {  
    //do something  
}

There are two micro-optimizations:

  1. No need to keep pattern in an extra variable, inlined that
  2. For the alias, search for word characters, not non-space characters

Actually, if you do a lot of processing like this and the pattern never changes, you should keep the compiled pattern in a constant:

private static final Pattern PATTERN = Pattern
            .compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"");

Matcher match = PATTERN.matcher(lineInFile);  
while(match.find()) {  
    //do something  
}

Update: I took some time on RegExr to come up with a much more specific pattern, which should only detect valid IP addresses as a bonus. I know it's ugly as hell, but my guess is that it's pretty efficient, as it eliminates most of the backtracking:

([A-Z]+)\s*\"((?:1[0-9]{2}|2(?:(?:5[0-5]|[0-9]{2})|[0-9]{1,2})\.)
{3}(?:1[0-9]{2}|2(?:5[0-5]|[0-9]{2})|[0-9]{1,2}))

(Wrapped for readability, all back-slashes need to be escaped in java, but you can test it on RegExr as it is with the OP's test string)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜