UTF8 Beginning of File characters are breaking serializer & readers

2022-12-11 21:46 问答作者：

Okay, I'm trying to work with UTF8 text files. I'm constantly fighting the BOM chars that the writer drops in for UTF8, which blows up pretty much anything I need to use to read the file including serializers and other text readers.

I'm getting a leading six bytes of data:

0xEF
0xBB
0xBF
0xEF
0xBB
0xBF

(now that I'm looking at it, I realize there's two characters there. Is that the UTF8 BOM marker? Am I double encoding it)?

Notice the se开发者_StackOverflow中文版rializer encodes to UTF8, then the memory stream gets a string as UTF8, then I write the string to the file with UTF8... seems like a lot of redundancy. Thoughts?

//I'm storing this xml result to a database field. (this one includes the BOF chars)
using (MemoryStream ms = new MemoryStream())
{
    Utility.SerializeXml(ms, root);
    xml = Encoding.UTF8.GetString(ms.ToArray());

}


//later on, I would take that xml and then write it out to a file like this: 
File.WriteAllText(path, xml, Encoding.UTF8);



public static void SerializeXml(Stream output, object data)
{
    XmlSerializer xs = new XmlSerializer(data.GetType());
    XmlWriterSettings settings = new XmlWriterSettings();
    settings.Indent = true;
    settings.IndentChars = "\t";
    settings.Encoding = Encoding.UTF8;
    XmlWriter writer = XmlTextWriter.Create(output, settings);
    xs.Serialize(writer, data);
    writer.Flush();
    writer.Close();
}

Yeah, that's two BOMs. You're encoding to UTF-8 twice and each time adds a pseudo-BOM, due to the extremely unfortunate fact that:

Encoding.UTF8

means “UTF-8 with a pointless, meaningless U+FEFF stuck to the front to screw up your applications”. Try instead using

new UTF8Encoding(false)

which should give you a less sucky version.

Yes that is a BOM.

Yes some older JDK's had a bug that blew up on UTF-8 BOM data. And two of them will confuse even a modern version of Java.

The solution I used was to stick a pushback stream on the front and filter it out.

Or use a more modern version of Java.

The byte sequence 0xEF 0xBB 0xBF is the UTF-8 encoding of U+FEFF, which is the Unicode BOM (byte order mark). It is unnecessary in UTF-8, but crucial in UTF-16 or UTF-32.

You've got the same sequence twice.

The only good thing to do with them is ignore and/or delete them.

继续阅读：unicode utf-8 xml-serialization

UTF8 Beginning of File characters are breaking serializer & readers

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？