How do I ignore the UTF-8 Byte Order Marker in String comparisons?
I'm having a problem comparing strings in a Unit Test in C# 4.0 using Visual Studio 2010. This same test case works properly in Visual Studio 2008 (with C# 3.5).
Here's the relevant code snippet:
byte[] rawData = GetData();
string data = Encoding.UTF8.GetString(rawData);
Assert.AreEqual("Constant", data, false, CultureInfo.InvariantCulture);
While debugging this test, the data
string appears to the naked eye to contain exactly the same string as the literal. When I called data.ToCharArray()
, I noticed that the first byte of the string data
is the value 65279
which is the UTF-8 Byte Order Marker. What I don't understand is why Encoding.UTF8.GetString()
keeps this byte around.
How do I get Encoding.UTF8.GetString()
to not开发者_运维问答 put the Byte Order Marker in the resulting string?
Update: The problem was that GetData()
, which reads a file from disk, reads the data from the file using FileStream.readbytes()
. I corrected this by using a StreamReader
and converting the string to bytes using Encoding.UTF8.GetBytes()
, which is what it should've been doing in the first place! Thanks for all the help.
Well, I assume it's because the raw binary data includes the BOM. You could always remove the BOM yourself after decoding, if you don't want it - but you should consider whether the byte array should consider the BOM to start with.
EDIT: Alternatively, you could use a StreamReader
to perform the decoding. Here's an example, showing the same byte array being converted into two characters using Encoding.GetString
or one character via a StreamReader
:
using System;
using System.IO;
using System.Text;
class Test
{
static void Main()
{
byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
string viaEncoding = Encoding.UTF8.GetString(withBom);
Console.WriteLine(viaEncoding.Length);
string viaStreamReader;
using (StreamReader reader = new StreamReader
(new MemoryStream(withBom), Encoding.UTF8))
{
viaStreamReader = reader.ReadToEnd();
}
Console.WriteLine(viaStreamReader.Length);
}
}
There is a slightly more efficient way to do it than creating StreamReader and MemoryStream:
1) If you know that there is always a BOM
string viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);
2) If you don't know, check:
string viaEncoding;
if (withBom.Length >= 3 && withBom[0] == 0xEF && withBom[1] == 0xBB && withBom[2] == 0xBF)
viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);
else
viaEncoding = Encoding.UTF8.GetString(withBom);
Unfortunately the BOM won't be removed with a simple Trim(). But it can be done as follows:
byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
byte[] bom = { 0xef, 0xbb, 0xbf };
var text = System.Text.Encoding.UTF8.GetString(withBom);
Console.WriteLine($"Untrimmed: {text.Length}, {text}");
var trimmed = text.Trim(System.Text.Encoding.UTF8.GetString(bom).ToCharArray());
Console.WriteLine($"Trimmed: {trimmed.Length}, {trimmed}");
Output: Untrimmed: 2, A Trimmed: 1, A
I believe the extra character is removed if you Trim() the decoded string
精彩评论