Removing diacritics in Polish
I'm trying to remove 开发者_如何学JAVAdiacritic characters from a pangram in Polish. I'm using code from Michael Kaplan's blog http://www.siao2.com/2007/05/14/2629747.aspx, however, with no success.
Consider following pangram: "Pchnąć w tę łódź jeża lub ośm skrzyń fig.". Everything works fine but for letter "ł", I still get "ł". I guess the problem is that "ł" is represented as single unicode character and there is no following NonSpacingMark.
Do you have any idea how I can fix it (without relying on custom mapping in some dictionary - I'm looking for some kind of unicode conversion)?
Some time ago I've come across this solution, which seems to work fine:
public static string RemoveDiacritics(this string s)
{
string asciiEquivalents = Encoding.ASCII.GetString(
Encoding.GetEncoding("Cyrillic").GetBytes(s)
);
return asciiEquivalents;
}
Here is my quick implementation of Polish stoplist with normalization of Polish diacritics.
class StopList
{
private HashSet<String> set = new HashSet<String>();
public void add(String word)
{
word = word.trim().toLowerCase();
word = normalize(word);
set.add(word);
}
public boolean contains(final String string)
{
return set.contains(string) || set.contains(normalize(string));
}
private char normalizeChar(final char c)
{
switch ( c)
{
case 'ą':
return 'a';
case 'ć':
return 'c';
case 'ę':
return 'e';
case 'ł':
return 'l';
case 'ń':
return 'n';
case 'ó':
return 'o';
case 'ś':
return 's';
case 'ż':
case 'ź':
return 'z';
}
return c;
}
private String normalize(final String word)
{
if (word == null || "".equals(word))
{
return word;
}
char[] charArray = word.toCharArray();
char[] normalizedArray = new char[charArray.length];
for (int i = 0; i < normalizedArray.length; i++)
{
normalizedArray[i] = normalizeChar(charArray[i]);
}
return new String(normalizedArray);
}
}
I couldnt find any other solution in the Net. So maybe it will be helpful for someone (?)
The approach taken in the article is to remove Mark, Nonspacing characters. Since as you correctly point out "ł" is not composed of two characters (one of which is Mark, Nonspacing) the behavior you see is expected.
I don't think that the structure of Unicode allows you to accomplish a fully automated remapping (the author of the article you reference reaches the same conclusion).
If you're just interested in Polish characters, at least the mapping is small and well-defined (see e.g. the bottom of http://www.biega.com/special-char.html). For the general case, I do no think an automated solution exists for characters that are not composed of a standard character plus a Mark, Nonspacing character.
It is in the Unicode chart, codepoint \u0142. Scroll down to the description, "Latin small letter with stroke", it has no decomposition listed. Don't know anything about Polish, but it is common for a letter to have a distinguishing mark that makes it its own letter instead of a base one with a diacritic.
You'll have to replace these manually (just like with ÆÐØÞßæðøþ in Latin-1).
Other people have had the same problem, so the Unicode Common Locale Data Repository has "Agreed to add a transliterator that does accent removal, even for overlaid accents." (Ticket #2884)
There are quite a few precomposed characters that have no meaningful decompositions.
(There are also a handful that could have reasonable decompositions that are prohibitted from such decomposition in most normalisation forms, as it would lead to differences between version, which would make them not really normalisation any more).
ł is one of these. IIRC it's also not possible to give a culture-neutral transcription to alphabets that don't use ł. I think Germans tend to transcribe it to w rather than l (or maybe it's someone else who does), which makes sense (it's not quite right sound either, but it's closer than l).
Propose it. Works perfect.
private static Dictionary<string, string> NormalizeTable()
{
return new Dictionary<string, string>()
{
{"ą", "a"},
{"ć", "c"},
{"ę", "e"},
{"ł", "l"},
{"ń", "n"},
{"ó", "o"},
{"ś", "s"},
{"ź", "z"},
{"ż", "z"},
};
}
public static string Normalize(string original)
{
if (original == null) return null;
var lower = original.ToLower();
var dictionary = NormalizeTable();
foreach (var (key, value) in dictionary)
{
lower = lower.Replace(key, value);
}
return lower;
}
public static string ReplacePolishSigns(this string input)
=> input.Replace("ą", "a")
.Replace("ć", "c")
.Replace("ę", "e")
.Replace("ł", "l")
.Replace("ń", "n")
.Replace("ó", "o")
.Replace("ś", "s")
.Replace("ż", "z")
.Replace("ź", "z");
I found solution which is handling also 'ł'
string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
精彩评论