How do I escape '+' in pattern matching to highlight keyword?
I'm implementing a keyword highlighter in Java. I'm using java.util.regex.Pattern
to highlight (mak开发者_JAVA百科ing bold) keyword within String content. The following piece of code is working fine for alphanumeric keywords, but it is not working for some special characters. For example, in String content, I would like to highlight the keyword c++
which has the special character + (plus), but it's not getting highlighted properly. How do I escape +
character so that c++
is highlighted?
public static void main(String[] args)
{
String content = "java,c++,ejb,struts,j2ee,hibernate";
System.out.println("CONTENT: " + content);
String highlight = "C++";
System.out.println("HIGHLIGHT KEYWORD: " + highlight);
//highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("\\b" + highlight + "\\b", java.util.regex.Pattern.CASE_INSENSITIVE);
System.out.println("PATTERN: " + pattern.pattern());
java.util.regex.Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println("Match found!!!");
for (int i = 0; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
content = matcher.replaceAll("<B>" + matcher.group(i) + "</B>");
}
}
System.out.println("RESULT: " + content);
}
Output:
CONTENT: java,c++,ejb,struts,j2ee,hibernate HIGHLIGHT KEYWORD: C++ PATTERN: \bC++\b Match found!!! c RESULT: java,c++,ejb,struts,j2ee,hibernateI even tried to escape '+' before calling Pattern.compile like this,
highlight = highlight.replaceAll(Pattern.quote("+"), "\\\\+");
but still I'm not able to get the syntax right. Can somebody help me solve this?
This should do what you need:
Pattern pattern = Pattern.compile(
"\\b"
+ Pattern.quote(highlight)
+ "\\b",
Pattern.CASE_INSENSITIVE);
Update: you are right, the above doesn't work for C++ (\b
matches word boundaries and doesn't recognize ++ as a word). We need a more complicated solution:
Pattern pattern = Pattern.compile(
"\\b"
+ Pattern.quote(highlight)
+ "(?![^\\p{Punct}\\s])", // matches if the match is not followed by
// anything other than whitespace or punctuation
Pattern.CASE_INSENSITIVE);
Update in response to comments: it seems that you need more logic in your pattern creation. Here's a helper method to create the pattern for you:
private static final String WORD_BOUNDARY = "\\b";
// edit this to suit your neds:
private static final String ALLOWED = "[^,.!\\-\\s]";
private static final String LOOKAHEAD = "(?!" + ALLOWED + ")";
private static final String LOOKBEHIND = "(?<!" + ALLOWED + ")";
public static Pattern createHighlightPattern(final String highlight) {
final Pattern pattern = Pattern.compile(
(Character.isLetterOrDigit(highlight.charAt(0))
? WORD_BOUNDARY : LOOKBEHIND)
+ Pattern.quote(highlight)
+ (Character.isLetterOrDigit(highlight.charAt(highlight.length() - 1))
? WORD_BOUNDARY : LOOKAHEAD),
Pattern.CASE_INSENSITIVE);
return pattern;
}
And here is some test code to check that it works:
private static void testMatch(final String haystack, final String needle) {
final Matcher matcher = createHighlightPattern(needle).matcher(haystack);
if (!matcher.find())
System.out.println("Failed to find pattern " + needle);
while (matcher.find())
System.out.println("Found additional match: " + matcher.group() +
" for pattern " + needle);
}
public static void main(final String[] args) {
final String testString = "java,c++,hibernate,.net,asp.net,c#,spring";
testMatch(testString, "java");
testMatch(testString, "c++");
testMatch(testString, ".net");
testMatch(testString, "c#");
}
When I run this method, I don't see any output (which is good :-))
The problem is that the \b
word boundary anchor is not matching, because +
is a non word character and I assume there is a whitespace following that is also a non word character.
A word boundary \b
is matching a change from a word character (Member in \w
) to a non word character (no member of \w
).
Also if you want to match a +
literally you have to escape it. Here you are searching for C++
that means match at least one C
and the ++
is a possessive quantifier matching at least 1 C
and does not backtrack.
Try changing your pattern to something like this
java.util.regex.Pattern.compile("\\b" + highlight + "(?=\s)", java.util.regex.Pattern.CASE_INSENSITIVE);
(?=\s)
is a positive lookahead that will check if there is a whitespace following your highlight
Additionally you will need to esacape the + your are searching for.
All you need is here :
Pattern.compile("\\Q"+highlight+"\\E", java.util.regex.Pattern.CASE_INSENSITIVE);
Assuming your keyword does not begin or end with punctuation, here is a commented regex which uses lookahead and lookbehind to achieve your desired matching behavior:
// Compile regex to match a keyword or keyphrase.
java.util.regex.Pattern pattern = java.util.regex.Pattern.compile(
"(?<=[\\s'\".?!,;:]|^) # Word preceded by ws, quote, punct or BOS.\n" +
// Escape any regex metacharacters in the keyword phrase.
java.util.regex.Pattern.quote(highlight) + " # Keyword to be matched.\n" +
"(?=[\\s'\".?!,;:]|$) # Word followed by ws, quote, punct or EOS.",
Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.COMMENTS);
Note that this solution works even if your keyword is a phrase containing spaces.
精彩评论