Why doesn't Đ get flattened to D when Removing Accents/Diacritics
I'm using this method to remove accents from my strings:
static string RemoveAccents(string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (char.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}
but this method leaves đ as đ and doesn't change it to d, even though d is its base char. you can try it with this input string "æøåáâăäĺćçčéęëěíîďđńňóôőöřů开发者_JS百科úűüýţ"
What's so special in letter đ?
The answer for why it doesn't work is that the statement that "d is its base char" is false. U+0111 (LATIN SMALL LETTER D WITH STROKE) has Unicode category "Letter, Lowercase" and has no decomposition mapping (i.e., it doesn't decompose to "d" followed by a combining mark).
"đ".Normalize(NormalizationForm.FormD)
simply returns "đ"
, which is not stripped out by the loop because it is not a non-spacing mark.
A similar issue will exist for "ø" and other letters for which Unicode provides no decomposition mapping. (And if you're trying to find the "best" ASCII character to represent a Unicode letter, this approach won't work at all for Cyrillic, Greek, Chinese or other non-Latin alphabets; you'll also run into problems if you wanted to transliterate "ß" into "ss", for example. Using a library like UnidecodeSharp may help.)
I have to admit that I'm not sure why this works but it sure seems to
var str = "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str));
=> "aoaaaaalccceeeeiiddnnooooruuuuyt"
"D with stroke" (Wikipedia) is used in several languages, and appears to be considered a distinct letter in all of them -- and that is why it remains unchanged.
string.Normalize(NormalizationForm)
is an easy way to remove 'real' diacricits (Wiki) but many letters you may want to convert are not affected by this.
I had simmilar problems with Ð & ð (letter Eth), đ, Æ & æ. To convert them into ANSI (Latin) use Unicode-conversion instead!
private static char[] ConvertUnicodeStringToSpecificEncoding(string input, int resultEncodingCode)
{
System.Text.Encoding unicodeEncoding = System.Text.Encoding.Unicode;
System.Text.Encoding specificEncoding = System.Text.Encoding.GetEncoding(resultEncodingCode);
byte[] convertedBytes = System.Text.Encoding.Convert(unicodeEncoding, specificEncoding, unicodeEncoding.GetBytes(input));
char[] convertedChars = new char[specificEncoding.GetCharCount(convertedBytes, 0, convertedBytes.Length)];
specificEncoding.GetChars(convertedBytes, 0, convertedBytes.Length, convertedChars, 0);
return convertedChars;
}
Call this method with multiple encoding on the same string to create an intersection on the letters you want to have left.
List of encodings: https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding?view=netframework-4.8
My solution looks like this
// Encoding Types (int Codes) https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding?view=netframework-4.8
private static readonly char[] charactersToSkip = new char[] { 'ä', 'ö', 'ü', 'Ä', 'Ö', 'Ü' };
private static readonly char[] specialCharsToSkip = new char[] { '^', '´', '`', '°', '!', '\'', '§', '$', '%', '&', '/', '(', ')', '=', '{', '[', ']', '}', '\\', '+', '-' };
private static readonly char[] ambiguousCharsToSkip = new char[] { '?' }; // Chars which might be a result of encoding-conversion and have to be skipped beforehand.
private static readonly int[] encodingsToRemoveDiacritics = new int[]
{
852, // 852 ibm852 Central European (DOS)
850, // 850 ibm850 Western European (DOS)
860, // 860 IBM860 Portuguese (DOS)
/* Warning:
* Only append encodings.
* Changing sort order of encodings may result in malfunctioning.
*/
};
public static string RemoveDiacritics(this string inputString)
{
if (string.IsNullOrEmpty(inputString))
{
return inputString;
}
var resultStringBuilder = new StringBuilder();
foreach (char currentChar in inputString)
{
if (charactersToSkip.Contains(currentChar) || specialCharsToSkip.Contains(currentChar) || ambiguousCharsToSkip.Contains(currentChar))
{
resultStringBuilder.Append(currentChar);
continue;
}
string normalizedString = currentChar.ToString().Normalize(NormalizationForm.FormD);
foreach (char normalizedChar in normalizedString)
{
if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(normalizedChar) != System.Globalization.UnicodeCategory.NonSpacingMark)
{
string convertedString = normalizedChar.ToString();
char[] convertedChars = null;
foreach (int encodingCode in encodingsToRemoveDiacritics)
{
convertedChars = ConvertUnicodeStringToSpecificEncoding(convertedString, encodingCode);
if (convertedChars.Contains('?') == false)
{
convertedString = new string(convertedChars);
}
}
resultStringBuilder.Append(convertedString);
}
}
}
return resultStringBuilder.ToString();
}
which creates following outputs
"abcdefghijklmnopqrstuvwxzy" -> "abcdefghijklmnopqrstuvwxzy"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ" -> "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"1234567890" -> "1234567890"
"ß" -> "ß"
"ÄÖÜ" -> "ÄÖÜ"
"äöü" -> "äöü"
"!\"§$%&/()=?" -> "!\"§$%&/()=?"
"+-_~'*#" -> "+-_~'*#"
",.;:" -> ",.;:"
"µ" -> "u" // My -> u
"<>|" -> "<>|"
"´`^°" -> "´`^°"
"²" -> "2" // ² -> 2
"³" -> "3" // ³ -> 3
"{}" -> "{}"
"[]" -> "[]"
"\\" -> "\\"
"áàâã" -> "aaaa"
"ÁÀÂÅ" -> "AAAA"
"éèêę" -> "eeee"
"ÉÈÊĚ" -> "EEEE"
"íìîï" -> "iiii"
"ÍÌÎ" -> "III"
"óòôõ" -> "oooo"
"ÓÒÔŌ" -> "OOOO"
"úùû" -> "uuu"
"ÚÙÛ" -> "UUU"
"ÇĆĈČĊ" -> "CCCCC"
"çćĉčċ" -> "ccccc"
"Ñ" -> "N"
"Æ" -> "A"
"æ" -> "a"
"ýÿ" -> "yy"
"ĹĻĽ" -> "LLL"
"Ð" -> "D"
"đ" -> "d"
"ð" -> "d"
this should work
private static String RemoveDiacritics(string text)
{
String normalized = text.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < normalized.Length; i++)
{
Char c = normalized[i];
if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
sb.Append(c);
}
return sb.ToString();
}
精彩评论