Java Regex Problem
I have a string that i am trying to extract patterns from, the string is as follows:
( ELT2N ( ELTOK wpSA910 wpSA909 wpSA908 wpSA474 ) )
The problem is, i dont know how many of the strings beginning with 'wp' will be in the string i am trying to search, however i want toi extract all of them using one statement. I am currently using the pattern below:
private final static String STARS_LINE_PATTERN = "\\(\\s+?(\\w+?)\\s+?开发者_C百科\\(\\s+(\\w+)\\s+?(\\w+?\\s??){1,}\\s+?\\)\\s+?\\)";
The pattern is matching the string and returning the 'ELT2N' and the 'ELTOK' strings but is not returning the strings prefixed by 'wp'.
Can anyone help?
Thanks
Simon
Java regex like most flavors can only keep the last capture when you repeat a capturing group.
For this particular problem, you may want to match the entire wp
sequence into one group in one regex, and then post-process it again with another regex. In this case, a simple split
is enough.
Here's a snippet to illustrate the idea:
import java.util.regex.*;
import java.util.*;
//...
String text = "( ELT2N ( ELTOK wpSA910 wpSA909 wpSA908 wpSA474 ) )";
String regex =
"< (word) < (word) ((?:word )+)> >"
.replace(" ", "\\s+")
.replace("<", "\\(")
.replace(">", "\\)")
.replace("word", "\\w+");
Matcher m = Pattern.compile(regex).matcher(text);
if (m.find()) {
System.out.printf("%s; %s;%n%s",
m.group(1),
m.group(2),
Arrays.toString(m.group(3).split("\\s+"))
);
}
The above prints:
ELT2N; ELTOK;
[wpSA910, wpSA909, wpSA908, wpSA474]
So the entire wp
sequence is captured by \3
of the regex pattern, which is then split
into its parts.
References
- regular-expressions.info/Repeating a Capturing Group vs Capturing a Repeating Group
Related questions
- Is there a regex flavor that allows me to count the number of repetitions matched by * and +?
- In .NET, you can query all intermediate
Captures
, but not so in Java
- In .NET, you can query all intermediate
MvanGeest's comment is correct, if you use a quantifier on a capture group, only the last value is stored. Put simply if you do not know how many 'sets' there are then the overall process cannot be done in a single step. You would first have to match all of the wp preceded strings into a single pattern so that you have "ELT2N", "ELTOK", "wpSA910 wpSA909 wpSA908 wpSA474", you would then have to parse the last string independently to seperate the other values. I've not used Java in years, and never Java Regex so I can't tell you the exact steps but using the pattern...
private final static String STARS_LINE_PATTERN = "\\(\\s+?(\\w+?)\\s+?\\(\\s+(\\w+)\\s+?((?:\\w+?\\s??){1,})\\s+?\\)\\s+?\\)";
...should split the string initially, in PHP I'd just use explode to split the \3 into an array to get the independent values, I'm sure you have something similar available.
How about String#split(" wp")
? Drop the first result, and you will need to fudge the last, but it will do the job.
It would be easier to do it without regex at all, like this:
String input = "( ELT2N ( ELTOK wpSA910 wpSA909 wpSA908 wpSA474 ) )";
String[] tokens = input.split();
String result = "";
for (int i = 0; i < tokens.length; i++) {
if (! tokens[i].startsWith("wp"));
result += tokens[i] + " ";
}
精彩评论