How do I detect unicode characters in a Java string?
Suppose I have a string that contains Ü开发者_JS百科. How would I find all those unicode characters? Should I test for their code? How would I do that?
For example, given the string "AÜXÜ", I'd like to transform it to "AYXY". I'd like to do the same for other unicode characters, and I would hate to have to store them in a translation map of some sort.
You could loop through your string and for every character call
If (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
// replace with Y
}
The definition of "unicode characters" is vague, but will be taken to mean UTF-8 characters not covered by the standard ISO 8859 charset. If this is true in your case, then loop through all characters in the String and test its codepoint to determine whether it is within the given character set.
Alternatively, use a Map<Character, Character>
and characters in the map that contain match the keys. For example:
Map<Character, Character> charReplacementMap = new HashMap<Character, Character>() {{
put('Ü', 'Y');
// Put more here.
}};
String originalString = "AÜAÜ";
StringBuilder builder = new StringBuilder();
for (char currentChar : originalString.toCharArray()) {
Character replacementChar = charReplacementMap.get(currentChar);
builder.append(replacementChar != null ? replacementChar : currentChar);
}
String newString = builder.toString();
Or, do you mean "all characters with diacritics"? If so, then use java.text.Normalizer
to remove diacritical marks:
/**
* Remove any diacritical marks (accents like ç, ñ, é, etc) from
* the given string (so that it returns plain c, n, e, etc).
* @param string The string to remove diacritical marks from.
* @return The string with removed diacritical marks, if any.
*/
public static String removeDiacriticalMarks(String string) {
return Normalizer.normalize(string, Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
One pitfall, Ü would become U, not Y. Not sure if that's what you're after. If you want to replace by pronounced character, you'll really need to create a mapping. Sure, it's a tedious work, but it's done in less time than you needed to follow this topic.
You could go the other way round and ask if the character is an ascii character.
public static boolean isAscii(char ch) {
return ch < 128;
}
You'd have to analyse the string char by char then of course.
(the method is from commons-lang CharUtils which contains loads of useful Character methods)
It isn't clear to me exactly what is gained by transforming "AÜXÜ" to "AYXY". Is this because Ü is pronounced like Y in a particular language? What language? And what other rules might apply?
In terms of terminology...
"a"
The above is a Unicode string. It contains a single UTF-16 encoded character.
If you wish to limit the range of characters to the English alphabet, have a look at the Normalization performed in this answer.
I'm not sure from your example what you're trying to do - if you're just trying to replace all non-ASCII values with Y, then you could loop through the string looking for codepoints outside of the range 0 to 127, and replace them those code points with Y.
The class Character
also offers some interesting methods. Take a look at it.
Character.UnicodeBlock.of('a') == Character.UnicodeBlock.BASIC_LATIN; //true
Character.UnicodeBlock.of('�') == Character.UnicodeBlock.BASIC_LATIN; //false
精彩评论