UTF8 Beginning of File characters are breaking serializer & readers
Okay, I'm trying to work with UTF8 text files. I'm constantly fighting the BOM chars that the writer drops in for UTF8, which blows up pretty much anything I need to use to read the file including serializers and other text readers.
I'm getting a leading six bytes of data:
0xEF
0xBB
0xBF
0xEF
0xBB
0xBF
(now that I'm looking at it, I realize there's two characters there. Is that the UTF8 BOM marker? Am I double encoding it)?
Notice the se开发者_StackOverflow中文版rializer encodes to UTF8, then the memory stream gets a string as UTF8, then I write the string to the file with UTF8... seems like a lot of redundancy. Thoughts?
//I'm storing this xml result to a database field. (this one includes the BOF chars)
using (MemoryStream ms = new MemoryStream())
{
Utility.SerializeXml(ms, root);
xml = Encoding.UTF8.GetString(ms.ToArray());
}
//later on, I would take that xml and then write it out to a file like this:
File.WriteAllText(path, xml, Encoding.UTF8);
public static void SerializeXml(Stream output, object data)
{
XmlSerializer xs = new XmlSerializer(data.GetType());
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.IndentChars = "\t";
settings.Encoding = Encoding.UTF8;
XmlWriter writer = XmlTextWriter.Create(output, settings);
xs.Serialize(writer, data);
writer.Flush();
writer.Close();
}
Yeah, that's two BOMs. You're encoding to UTF-8 twice and each time adds a pseudo-BOM, due to the extremely unfortunate fact that:
Encoding.UTF8
means “UTF-8 with a pointless, meaningless U+FEFF stuck to the front to screw up your applications”. Try instead using
new UTF8Encoding(false)
which should give you a less sucky version.
Yes that is a BOM.
Yes some older JDK's had a bug that blew up on UTF-8 BOM data. And two of them will confuse even a modern version of Java.
The solution I used was to stick a pushback stream on the front and filter it out.
Or use a more modern version of Java.
The byte sequence 0xEF 0xBB 0xBF is the UTF-8 encoding of U+FEFF, which is the Unicode BOM (byte order mark). It is unnecessary in UTF-8, but crucial in UTF-16 or UTF-32.
You've got the same sequence twice.
The only good thing to do with them is ignore and/or delete them.
精彩评论