Remove all non-"word characters" from a String in Java, leaving accented characters?
Apparently Java's Regex flavor counts Umlauts and other special characters as non-"word characters" when I use Regex.
"TESTÜTEST".replaceAll( "\\W", "" )
returns "TESTTEST" for me. What I want开发者_如何学JAVA is for only all truly non-"word characters" to be removed. Any way to do this without having something along the lines of
"[^A-Za-z0-9äöüÄÖÜßéèáàúùóò]"
only to realize I forgot ô?
Use [^\p{L}\p{Nd}]+
- this matches all (Unicode) characters that are neither letters nor (decimal) digits.
In Java:
String resultString = subjectString.replaceAll("[^\\p{L}\\p{Nd}]+", "");
Edit:
I changed \p{N}
to \p{Nd}
because the former also matches some number symbols like ¼
; the latter doesn't. See it on regex101.com.
I was trying to achieve the exact opposite when I bumped on this thread. I know it's quite old, but here's my solution nonetheless. You can use blocks, see here. In this case, compile the following code (with the right imports):
> String s = "äêìóblah";
> Pattern p = Pattern.compile("[\\p{InLatin-1Supplement}]+"); // this regex uses a block
> Matcher m = p.matcher(s);
> System.out.println(m.find());
> System.out.println(s.replaceAll(p.pattern(), "#"));
You should see the following output:
true
#blah
Best,
At times you do not want to simply remove the characters, but just remove the accents. I came up with the following utility class which I use in my Java REST web projects whenever I need to include a String in an URL:
import java.text.Normalizer;
import java.text.Normalizer.Form;
import org.apache.commons.lang.StringUtils;
/**
* Utility class for String manipulation.
*
* @author Stefan Haberl
*/
public abstract class TextUtils {
private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" };
private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue",
"sz" };
/**
* Normalizes a String by removing all accents to original 127 US-ASCII
* characters. This method handles German umlauts and "sharp-s" correctly
*
* @param s
* The String to normalize
* @return The normalized String
*/
public static String normalize(String s) {
if (s == null)
return null;
String n = null;
n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList);
n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", "");
return n;
}
/**
* Returns a clean representation of a String which might be used safely
* within an URL. Slugs are a more human friendly form of URL encoding a
* String.
* <p>
* The method first normalizes a String, then converts it to lowercase and
* removes ASCII characters, which might be problematic in URLs:
* <ul>
* <li>all whitespaces
* <li>dots ('.')
* <li>(semi-)colons (';' and ':')
* <li>equals ('=')
* <li>ampersands ('&')
* <li>slashes ('/')
* <li>angle brackets ('<' and '>')
* </ul>
*
* @param s
* The String to slugify
* @return The slugified String
* @see #normalize(String)
*/
public static String slugify(String s) {
if (s == null)
return null;
String n = normalize(s);
n = StringUtils.lowerCase(n);
n = n.replaceAll("[\\s.:;&=<>/]", "");
return n;
}
}
Being a German speaker I've included proper handling of German umlauts as well - the list should be easy to extend for other languages.
HTH
EDIT: Note that it may be unsafe to include the returned String in an URL. You should at least HTML encode it to prevent XSS attacks.
Well, here is one solution I ended up with, but I hope there's a more elegant one...
StringBuilder result = new StringBuilder();
for(int i=0; i<name.length(); i++) {
char tmpChar = name.charAt( i );
if (Character.isLetterOrDigit( tmpChar) || tmpChar == '_' ) {
result.append( tmpChar );
}
}
result
ends up with the desired result...
You might want to remove the accents and diacritic signs first, then on each character position check if the "simplified" string is an ascii letter - if it is, the original position shall contain word characters, if not, it can be removed.
精彩评论