How do you use the Java word boundary with apostrophes?
I am trying to delete all the occurrences of a word in a list, but I am having trouble when there are apostrophes in the words.
String phrase="bob has a bike and bob's bike is red";
String word="bob";
phrase=phrase.replaceAll("\\b"+word+"\\b","");
System.out.println(phrase);
output:
has a bike and 's bike is red
What I want is
has a bike and bob's bike is red
I have a limited understanding of regex so I'm guessing there is a solution, but I do not now enough to create the regex to handle apostrophes. Also I would like开发者_Go百科 it to work with dashes so the phrase the new mail is e-mail
would only replace the first occurrence of mail.
It all depends on what you understan to be a "word". Perhaps you'd better define what you understand to be a word delimiter: for example, blanks, commas .... And write something as
phrase=phrase.replaceAll("([ \\s,.;])" + Pattern.quote(word)+ "([ \\s,.;])","$1$2");
But you'll have to check additionally for occurrences at the start and the end of the string For example:
String phrase="bob has a bike bob, bob and boba bob's bike is red and \"bob\" stuff.";
String word="bob";
phrase=phrase.replaceAll("([\\s,.;])" + Pattern.quote(word) + "([\\s,.;])","$1$2");
System.out.println(phrase);
prints this
bob has a bike , and boba bob's bike is red and "bob" stuff.
Update: If you insist in using \b
, considering that the "word boundary" understand Unicode, you can also do this dirty trick: replace all ocurrences of '
by some Unicode letter that you're are sure will not appear in your text, and afterwards do the reverse replacemente. Example:
String phrase="bob has a bike bob, bob and boba bob's bike is red and \"bob\" stuff.";
String word="bob";
phrase= phrase.replace("'","ñ").replace('"','ö');
phrase=phrase.replaceAll("\\b" + Pattern.quote(word) + "\\b","");
phrase= phrase.replace('ö','"').replace("ñ","'");
System.out.println(phrase);
UPDATE: To summarize some comments below: one would expect \w
and \b
to have the same notion as to which is a "word character", as almost every regular-expression dialect do. Well, Java does not: \w
considers ASCII, \b
considers Unicode. It's an ugly inconsistence, I agree.
Update 2: Since Java 7 (as pointed out in comments) the UNICODE_CHARACTER_CLASS flag allows to specify a consistent Unicode-only behaviour, see eg here.
\b\S*(bob|mail)\S*\b
Be careful with false positives, this could match more than you want. If you need "prefixes" or "sufixes" of no more than 2 characters (that would be things like "'s"
or "e-"
), use \S{0,2}
instead of \S*
.
The regex says:
\b # a word boundary
\S* # any number of non-spaces
( # match group 1 (to enable a choice)
bob|mail # "bob" or "mail"
) # end match group 1
\S* # any number of non-spaces
\b # a word boundary
So, in Java:
phrase = phrase.replaceAll("\\b\\S*(bob|mail)\\S*\\b", "");
Be careful with things like
phrase = phrase.replaceAll("\\b" + word + "\\b", "");
That should be
phrase = phrase.replaceAll("\\b" + Pattern.quote(word) + "\\b", "");
since whenever word
contains regex meta characters, your regex will break unless you properly escape the string beforehand using Pattern.quote()
.
精彩评论