开发者

C# XmlWriter and invalid UTF8 characters

We created a unit test that uses the following methods to generate random UTF8 text:

        private static Random _rand = new Random(Environment.TickCount);

        public static byte CreateByte()
        {
            return (byte)_rand.Next(byte.MinValue, byte.MaxValue + 1);
        }

        public static byte[] CreateByteArray(int length)
        {
            return Repeat(CreateByte, length).ToArray();
        }

        public static string CreateUtf8String(int length)
        {
            return Encoding.UTF8.GetString(CreateByteArray(length));
        }

        private static IEnumerable<T> Repeat<T>(Func<T> func, int count)
        {
            for (int i = 0; i < count; i++)
            {
                yield return func();
            }
        }

In sending the random UTF8 strings to our business logic, XmlWriter writes the generated string and can fail with the error:

Test method UnitTest.Utf8 threw exception: 
System.ArgumentException: ' ', hexadecimal value 0x0E, is an invalid character.

System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize)
System.Xml.XmlUtf8RawTextWriter.WriteAttributeTextBlock(Char* pSrc, Char* pSrcEnd)
System.Xml.XmlUtf8RawTextWriter.WriteString(String text)
System.Xml.XmlUtf8RawTextWriter开发者_StackOverflow中文版Indent.WriteString(String text)
System.Xml.XmlWellFormedWriter.WriteString(String text)
System.Xml.XmlWriter.WriteAttributeString(String localName, String value)

We want to support any possible string to be passed in, and need these invalid characters escaped somehow.

XmlWriter already escapes things like &, <, >, etc., how can we deal with other invalid characters such as control characters, etc?

PS - let me know if our UTF8 generator is flawed (I'm already seeing where I shouldn't let it generate '\0')


The XmlConvert Class has a lot of useful methods (like EncodeName, IsXmlChar, ...) for making sure you're building valid Xml.


There are two problems:

  1. Not all characters are valid for XML, even escaped. For XML 1.0, the only characters with a Unicode codepoint value of less than 0x0020 that are valid are TAB (&#9;), LF (&#10;), and CR (&#13;). See XML 1.0, Section 2.2, Characters .

    For XML 1.1, which relatively few systems support, any character except NUL can be escaped in this manner.

  2. Not all sequences of bytes are valid for UTF-8. For example, according to the specification, "The octet values C0, C1, F5 to FF never appear." Probably you would be better off just creating Strings of characters and ignoring UTF-8, or creating the String, converting it to UTF-8 and back if you're really into encoding.


Your UTF-8 generator appears to be flawed. There are many byte sequences which are invalid UTF-8 encodings.

A better way to generate valid random UTF-8 encodings is to generate random characters, put them into a string and then encode the string to UTF-8.


Mark points out that not every byte sequence is a valid UTF-8 sequence.

I'd like to add that not every character can exist in an XML document. Only some characters are valid, and this is true even if they are encoded as a numeric character reference.

Update: If you want to encode arbitrary binary data in XML, then use Base64 or some other encoding before writing them to XML.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜