开发者

Serializing an object to a string: why is my encoding adding stupid characters?

I need to get the serialized XML representation of an object as a string. I'm using the XmlSerializer and a memoryStream to do this.

XmlSerializer serializer = new XmlSerializer(typeof(MyClass));
using (MemoryStream stream = new MemoryStream())
{
  using (XmlTextWriter writer = new XmlTextWriter(stream,Encoding.UTF8))
  {
    serializer.Serialize(writer, myClass);
    string xml = Encoding.UTF8.GetString(stream.ToArray());
    //other chars may be added from the encoding.
    xml = xml.Substring(xml.IndexOf(Convert.ToChar(60)));
    xml = xml.Substring(0, (xml.LastIndexOf(Convert.ToChar(62)) + 1));
    return xml;
  }
}

Now just take note of the xml.substring lines for a moment. What I'm finding is that (even thought I'm specifying encoding on the XmlTextWriter and on the GetString (and I'm using memoryStream.ToArray(), so I'm operating only on the data in the stream's buffer)... the resulting xml string has some non-xml happy character added. In my case, a 开发者_C百科'?' at the start of the string. This is why I'm substring-ing for '<' and '>' to ensure I've only getting the good stuff.

Strange thing is, looking at this string in the debugger (Text Visualizer), I don't see this '?'. Only when I paste what's in the visualizer into notepad or similar.

So while the above code (substring etc) does the job, what's actually happening here? Is some unsigned byte thing being included and not being represented in the Text Visualizer?


You can exclude the BOM by specifying the encoding specifically - i.e. instead of Encoding.UTF8, try using:

using (MemoryStream stream = new MemoryStream())
{
  var enc = new UTF8Encoding(false);
  using (XmlTextWriter writer = new XmlTextWriter(stream,enc))
  {
    serializer.Serialize(writer, myClass);        
  }
  string xml = Encoding.UTF8.GetString(
      stream.GetBuffer(), 0, (int)stream.Length);
}


What you are looking at is a Byte Order Mark (BOM). It is normal in UTF8!

In short, for my comment fans: They are byte markers that determine the endianness of a string.

What you can do is either use a) ASCII as your encoding, which will drop the byte order marks .. or b) why not leave them in? They do serve a useful function after all for your xml string.

Marc Gravell, below, gives a third alternative by creating your own encoding object and specify false in the constructor to suppress byte order markers.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜