Java regex logical OR
I am trying to match any or all from a set of phrases in a given string. Here is my regex:
(^|\\W)(" + phrase1 + "|" + phrase2 + "|" + phrase3 + ... ")(\\W|$)
I need to be able to match any number of the phrases I am OR
i开发者_运维百科ng. It seems to work okay except when two phrases occur immediately next to each other. So "phrase1 lorem ipsum phrase2 lorem ipsum"
matches both phrase1 and phrase2 but "phrase1 phrase2 lorem ipsum"
matches only phrase1 (so does "phrase1.phrase2 lorem ipsum"
). If there is more than one non-word character (e.g., two or more spaces) between phrase1 and phrase2 then it matches both as well. What am I doing wrong?
It's because you have \\W
at both sides of your regexp. That is, the first non-word character is matched by the first match, then the second match requires one more non-word character.
What I suspect you're after is this:
List<String> findPhrases(String s, String... phrases) {
return findPhrases(s, Arrays.asList(phrases));
}
List<String> findPhrases(String s, Collection<String> phrases) {
if (phrases.size() < 1) {
throw new IllegalArgumentException("must specify at least one phrase");
}
StringBuilder sb = new StringBuilder();
Iterator<String> iter = phrases.iterator();
String first = iter.next();
sb.append(first);
while (iter.hasNext()) {
sb.append("|");
sb.append(iter.next());
}
Pattern p = Pattern.compile("\\b(" + sb.toString() + ")\\b");
Matcher m = p.matcher(s);
List<String> ret = new ArrayList<String>();
while (m.find()) {
ret.append(Pattern.quote(m.group(1)));
}
return ret;
}
One important difference here is that I've used \b rather than \W to delimit words. \b is a zero-width match to the start of the string, the end of the string or the transition from a word character to a non-word character of vice versa.
Zero-width means it doesn't consume a character from the input like \W does.
Edit: you seem to have two problems:
- \W is consuming characters from your input; and
- You have regex special characters in your phrases.
(1) can be handled several ways. My approach above is to use \b instead as it is zero-width and is a much better solution. You can also use other zero-width assertions like lookaheads and lookbehinds:
<?<=\W|^)...(?=\W|$)
but that's basically equivalent to:
\b...\b
which is far easier to read.
(2) can be handled by quoting phrases. I've amended the above code to call Pattern.quote()
to quote any regex special characters.
精彩评论