开发者

String from Servlet with control characters in XML CDATA

My question is similar to Why are "control" characters illegal in XML 1.0? - however I'm looking for a solution to the problem below, rather than 开发者_JAVA技巧why the XML spec disallows control characters in XML.

I have a servlet, which prints a String containing an XML upon user request. One particular element contains a CDATA section which is required to contain some user input text.

Now it so happens that in one particular case, our user input contains the character U+0001 (a control character). And even though I specify the charset as UTF-8, the servlet throws an error:

Error: not well-formed
Location: 

<![CDATA[ 

Is there a way I can process the Java String to make it "XML safe" ? Particularly, to make it safe when put in the CDATA section?

I hope my question is clear!

Thanks in advance, Raj


The only conforming way to make this XML-safe is to add your own encoding.

You can do one of those two (for example):

  • Store all data as textual data and replace all forbidden characters with some unicode-escape mechanism (other than the one defined in XML itself!). For example you could use the one used by Java: \u0001 for U+0001. or
  • store the data as binary data and use base64Binary of hexBinary to store your data in XML.

Both of those approaches need explicit support in both the consumer and the producer. The second approach has the advantage of using well-defined data types with wide support, but if your content is actually text, you need to specify (or communicate) the encoding used in the byte stream (a necessity that is otherwise negated by XML itself).

If removing all non-transferable characters would be appropriate, then this regex should do the trick:

Pattern XML_INVALID_CHARS = Pattern.compile("[^\u0009\n\r\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF ]+");
String xmlSafe = XML_INVALID_CHARS.matcher(input).replaceAll("");

Note that the spec suggests that document authors be even more restrictive with the set of characters allowed in a note. That regex would be a bit longer.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜