开发者

Why does XMLEventReader report a CHARACTERS event that contains markup?

I have an XMLEventReader. It has been built from an XMLInputFactory with the "UTF8" encoding. I am using it to read an XML file whose "encoding" attribute is set to "UTF-8".

I have verified that the XML file views correctly under Firefox. When you view the page encoding, it says that it is UTF-8.

I have set the XMLEventReader to coalesce character events like this:

reader.setProperty(XMLEventReader.IS_COALESCING, Boolean.TRUE);

The XML document does not have a DTD. It is valid.

The XMLEventReader will occasionally report that a CHARACTERS event has been received whose content is (minus the quotation marks), for example:

r problems were most severe and frequent.) Did you sleep a lot more than usual nearly every night during that period?</text>  Ð 

Note the presence of the markup tag near the end of the sample, as well as the capital thorn. Note also that the sentence has been lopped off; presumably there was another CHARACTERS event before this one that contains the leading part of the sentence.

Why does the XMLEventReader screw up the parsing? Why are the characters not displaying correctly? Why does the XMLEventReader not coalesce CHARACTERS events, if that's what's going on? Why is StAX so unbelievably festeringly ugly and unpredictable?

I am using the XMLEventReader supplied to me by my Java runtime (Java 6) on a Mac.

Here is some sample XML, which of course I've simply copied from my editor, so who knows what character conversions occurred as a result of that, but anyhow:

<question id="BMHPD17">
  <permittedResponseCount>1</permittedResponseCount>
  <text>It’s hard for me to stay out of trouble. (Would you say this is true or false for you?)</text>
  <namedAnswerSet idref="TrueFalse"></namedAnswerSet>
</question>

Note the "smart apostrophe" on line 3.

I am reading this by reacting to a CHARACTERS event, saving its contents to a String on the stack, then reacting to an END_ELEMENT event whose name is "question". Upon receiving the END_ELEMENT event for question, I retrieve the value of the String I just mentioned, and construct a Java object that takes the string I just mentioned as input.

When I System.out.println() the result, I get (s开发者_如何学Goometimes) the bogus junk I referred to earlier.

When I wrap System.out inside a PrintWriter with "UTF8" encoding set, so that I'm not simply outputting characters according to the platform's encoding, I get the same results.


This turns out to be a bug on Mac OSX's JVM. The character encoding used by the console does not default to UTF-8, even though all other usages of the default character encoding are UTF8.


Is this even the same as the underlying SAX event, which includes a start offset and length? If so, you will probably find these specify a region of the string that excludes the markup.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜