Removing diacritics in Polish

2023-01-12 18:54 问答作者：

I'm trying to remove 开发者_如何学JAVAdiacritic characters from a pangram in Polish. I'm using code from Michael Kaplan's blog http://www.siao2.com/2007/05/14/2629747.aspx, however, with no success.

Consider following pangram: "Pchnąć w tę łódź jeża lub ośm skrzyń fig.". Everything works fine but for letter "ł", I still get "ł". I guess the problem is that "ł" is represented as single unicode character and there is no following NonSpacingMark.

Do you have any idea how I can fix it (without relying on custom mapping in some dictionary - I'm looking for some kind of unicode conversion)?

Some time ago I've come across this solution, which seems to work fine:

    public static string RemoveDiacritics(this string s)
    {
        string asciiEquivalents = Encoding.ASCII.GetString(
                     Encoding.GetEncoding("Cyrillic").GetBytes(s)
                 );

        return asciiEquivalents;
    }

Here is my quick implementation of Polish stoplist with normalization of Polish diacritics.

    class StopList
{
    private HashSet<String> set = new HashSet<String>();

    public void add(String word)
    {
        word = word.trim().toLowerCase();
        word = normalize(word);
        set.add(word);

    }

    public boolean contains(final String string)
    {
        return set.contains(string) || set.contains(normalize(string));
    }

    private char normalizeChar(final char c)
    {
        switch ( c)
        {
            case 'ą':
                return 'a';
            case 'ć':
                return 'c';
            case 'ę':
                return 'e';
            case 'ł':
                return 'l';
            case 'ń':
                return 'n';
            case 'ó':
                return 'o';
            case 'ś':
                return 's';
            case 'ż':
            case 'ź':
                return 'z';
        }
        return c;
    }

    private String normalize(final String word)
    {
        if (word == null || "".equals(word))
        {
            return word;
        }
        char[] charArray = word.toCharArray();
        char[] normalizedArray = new char[charArray.length];
        for (int i = 0; i < normalizedArray.length; i++)
        {
            normalizedArray[i] = normalizeChar(charArray[i]);
        }
        return new String(normalizedArray);
    }
}

I couldnt find any other solution in the Net. So maybe it will be helpful for someone (?)

The approach taken in the article is to remove Mark, Nonspacing characters. Since as you correctly point out "ł" is not composed of two characters (one of which is Mark, Nonspacing) the behavior you see is expected.

I don't think that the structure of Unicode allows you to accomplish a fully automated remapping (the author of the article you reference reaches the same conclusion).

If you're just interested in Polish characters, at least the mapping is small and well-defined (see e.g. the bottom of http://www.biega.com/special-char.html). For the general case, I do no think an automated solution exists for characters that are not composed of a standard character plus a Mark, Nonspacing character.

It is in the Unicode chart, codepoint \u0142. Scroll down to the description, "Latin small letter with stroke", it has no decomposition listed. Don't know anything about Polish, but it is common for a letter to have a distinguishing mark that makes it its own letter instead of a base one with a diacritic.

You'll have to replace these manually (just like with ÆÐØÞßæðøþ in Latin-1).

Other people have had the same problem, so the Unicode Common Locale Data Repository has "Agreed to add a transliterator that does accent removal, even for overlaid accents." (Ticket #2884)

There are quite a few precomposed characters that have no meaningful decompositions.

(There are also a handful that could have reasonable decompositions that are prohibitted from such decomposition in most normalisation forms, as it would lead to differences between version, which would make them not really normalisation any more).

ł is one of these. IIRC it's also not possible to give a culture-neutral transcription to alphabets that don't use ł. I think Germans tend to transcribe it to w rather than l (or maybe it's someone else who does), which makes sense (it's not quite right sound either, but it's closer than l).

Propose it. Works perfect.

private static Dictionary<string, string> NormalizeTable()
{
    return new Dictionary<string, string>()
    {
        {"ą", "a"},
        {"ć", "c"},
        {"ę", "e"},
        {"ł", "l"},
        {"ń", "n"},
        {"ó", "o"},
        {"ś", "s"},
        {"ź", "z"},
        {"ż", "z"},
    };
}

public static string Normalize(string original)
{
    if (original == null) return null;
    var lower = original.ToLower();
    var dictionary = NormalizeTable();
    foreach (var (key, value) in dictionary)
    {
        lower = lower.Replace(key, value);
    }
    return lower;
}

public static string ReplacePolishSigns(this string input) 
        => input.Replace("ą", "a")
            .Replace("ć", "c")
            .Replace("ę", "e")
            .Replace("ł", "l")
            .Replace("ń", "n")
            .Replace("ó", "o")
            .Replace("ś", "s")
            .Replace("ż", "z")
            .Replace("ź", "z");

I found solution which is handling also 'ł'

string RemoveDiacritics(string text)
    {
        var normalizedString = text.Normalize(NormalizationForm.FormD);
        var stringBuilder = new StringBuilder();

        foreach (var c in normalizedString)
        {
            var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
            if (unicodeCategory != UnicodeCategory.NonSpacingMark)
            {
                stringBuilder.Append(c);
            }
        }

        return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
    }

继续阅读：.net diacritics polish unicode

Removing diacritics in Polish

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？