开发者

C# char/byte encoding equality

I have some code to dump strings to stdout to check their encoding, it looks like this:

    private void DumpString(string s)
    {   
        System.Console.Write("{0}: ", s);
        foreach (byte b in s)
        {   
            System.Console.Write("{0}({1}) ", (char)b, b.ToString("x2"));
        }       
        System.Console.WriteLine();
    }

Consider two strings, each of which appear as "ë", but with different encodings. DumpString will produce the following output:

ë: e(65)(08)

ë: ë(eb)

The code looks like this:

DumpString(string1);
DumpString(string2);

How can I convert string2, using the System.Text.Encoding, to be byte equivalen开发者_如何学Got to string1.


They don't have different encodings. Strings in C# are always UTF-16 (thus, you shouldn't use byte to iterate over strings because you'll lose the top 8 bits). What they have is different normalization forms.

Your first string is "\u0065\u0308": LATIN SMALL LETTER E + COMBINING DIAERESIS. This is the decomposed form (NFD).

The second is "\u00EB": LATIN SMALL LETTER E WITH DIAERESIS. This is the precomposed form (NFC).

You can convert between them with string.Normalize.


You're looking for the String.Normalize method.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜