开发者

How do I ignore the UTF-8 Byte Order Marker in String comparisons?

I'm having a problem comparing strings in a Unit Test in C# 4.0 using Visual Studio 2010. This same test case works properly in Visual Studio 2008 (with C# 3.5).

Here's the relevant code snippet:

byte[] rawData = GetData();
string data = Encoding.UTF8.GetString(rawData);

Assert.AreEqual("Constant", data, false, CultureInfo.InvariantCulture);

While debugging this test, the data string appears to the naked eye to contain exactly the same string as the literal. When I called data.ToCharArray(), I noticed that the first byte of the string data is the value 65279 which is the UTF-8 Byte Order Marker. What I don't understand is why Encoding.UTF8.GetString() keeps this byte around.

How do I get Encoding.UTF8.GetString() to not开发者_运维问答 put the Byte Order Marker in the resulting string?

Update: The problem was that GetData(), which reads a file from disk, reads the data from the file using FileStream.readbytes(). I corrected this by using a StreamReader and converting the string to bytes using Encoding.UTF8.GetBytes(), which is what it should've been doing in the first place! Thanks for all the help.


Well, I assume it's because the raw binary data includes the BOM. You could always remove the BOM yourself after decoding, if you don't want it - but you should consider whether the byte array should consider the BOM to start with.

EDIT: Alternatively, you could use a StreamReader to perform the decoding. Here's an example, showing the same byte array being converted into two characters using Encoding.GetString or one character via a StreamReader:

using System;
using System.IO;
using System.Text;

class Test
{
    static void Main()
    {
        byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
        string viaEncoding = Encoding.UTF8.GetString(withBom);
        Console.WriteLine(viaEncoding.Length);

        string viaStreamReader;
        using (StreamReader reader = new StreamReader
               (new MemoryStream(withBom), Encoding.UTF8))
        {
            viaStreamReader = reader.ReadToEnd();           
        }
        Console.WriteLine(viaStreamReader.Length);
    }
}


There is a slightly more efficient way to do it than creating StreamReader and MemoryStream:

1) If you know that there is always a BOM

string viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);

2) If you don't know, check:

string viaEncoding;
if (withBom.Length >= 3 && withBom[0] == 0xEF && withBom[1] == 0xBB && withBom[2] == 0xBF)
    viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);
else
    viaEncoding = Encoding.UTF8.GetString(withBom);


Unfortunately the BOM won't be removed with a simple Trim(). But it can be done as follows:

byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };    
byte[] bom = { 0xef, 0xbb, 0xbf };
var text = System.Text.Encoding.UTF8.GetString(withBom);

Console.WriteLine($"Untrimmed: {text.Length}, {text}");
var trimmed = text.Trim(System.Text.Encoding.UTF8.GetString(bom).ToCharArray());
Console.WriteLine($"Trimmed: {trimmed.Length}, {trimmed}");

Output: Untrimmed: 2, A Trimmed: 1, A


I believe the extra character is removed if you Trim() the decoded string

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜