Removing accent marks (diacritics) from Latin characters for comparison [duplicate]
I need to compare the names of European places that are written using the L开发者_StackOverflow社区atin alphabet with accent marks (diacritics) on some characters. There are lots of Central and Eastern European names that are written with accent marks like Latin characters on ž and ü, but some people write the names just using the regular Latin characters without accent marks like z and u.
I need a way to have my system recognize for example mšk žilina being the same as msk zilina, and similar for all the other accented characters used. Is there a simple way to do this?
You can make use of java.text.Normalizer and a little regex to get rid of the diacritical marks.
public static String removeDiacriticalMarks(String string) {
return Normalizer.normalize(string, Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
Usage example:
String text = "mšk žilina";
String normalized = removeDiacriticalMarks(text);
System.out.println(normalized); // msk zilina
加载中,请稍侯......
精彩评论