C#: shield XmlTextReader from an occasional Unicode character
In C#, I have a XmlTextReader created directly from an HTTP response (I have no control over the XML content of the response).
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
XmlTextReader reader = new XmlTextReader(response.GetResponseStream());
It works, but sometimes one of the XML element nodes will contain a Unicode character (e.g. "é") which t开发者_JS百科rips the reader. I've tried to use a StreamReader with declared encoding, but now the XmlTextReader quits out on the very first line: "Data invalid. Line 1, position 1":
StreamReader sReader = new StreamReader(response.GetResponseStream(), System.Text.Encoding.Unicode);
XmlTextReader reader = new XmlTextReader(sReader);
Is there a way to fix this? Alternatively, is there a way to prevent the XmlTextReader from parsing an element (I know its name) with a potentially offending character? I don't care about that particular element, I just don't want it to trip the reader.
EDIT: Quick fix: read the response into a StringBuilder ("sb"):
sb.Replace("é", "e");
StringReader strReader = new StringReader(sb.ToString());
XmlTextReader reader = new XmlTextReader(strReader);
It is not a Unicode character, it is an invalid character (not correctly encoded).
There is no way to shield an XmlTextReader
from invalid XML. You need to either
- Fix the server side to properly encode characters
- Pre-process the text to do it yourself
According to UTF8, all such characters ("é") are encoded with 2 or 3 bytes (or more). You can use a hex editor to verify it.
What do you mean by "trips the reader"? Your first snippet of code should be fine - if the XML is genuinely in the encoding it declares (please look at the XML declaration) then it should be absolutely fine.
If the XML is genuinely broken, I would suggest performing some sort of filtering before XML parsing (e.g. loading the XML into a string with the right encoding, then fixing the declared encoding to match)... but we'll need to work out what's wrong with it first.
精彩评论