Removing accent marks (diacritics) from Latin characters for comparison [duplicate]
I need to compare the names of European places that are written using the L开发者_StackOverflow社区atin alphabet with accent marks (diacritics) on some characters. There are lots of Central and Eastern European names that are written with accent marks like Latin characters on ž
and ü
, but some people write the names just using the regular Latin characters without accent marks like z
and u
.
I need a way to have my system recognize for example mšk žilina
being the same as msk zilina
, and similar for all the other accented characters used. Is there a simple way to do this?
You can make use of java.text.Normalizer
and a little regex to get rid of the diacritical marks.
public static String removeDiacriticalMarks(String string) {
return Normalizer.normalize(string, Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
Usage example:
String text = "mšk žilina";
String normalized = removeDiacriticalMarks(text);
System.out.println(normalized); // msk zilina
精彩评论