Why does XMLEventReader report a CHARACTERS event that contains markup?

2023-01-16 02:26 问答作者：

I have an XMLEventReader. It has been built from an XMLInputFactory with the "UTF8" encoding. I am using it to read an XML file whose "encoding" attribute is set to "UTF-8".

I have verified that the XML file views correctly under Firefox. When you view the page encoding, it says that it is UTF-8.

I have set the XMLEventReader to coalesce character events like this:

reader.setProperty(XMLEventReader.IS_COALESCING, Boolean.TRUE);

The XML document does not have a DTD. It is valid.

The XMLEventReader will occasionally report that a CHARACTERS event has been received whose content is (minus the quotation marks), for example:

r problems were most severe and frequent.) Did you sleep a lot more than usual nearly every night during that period?</text>  Ð

Note the presence of the markup tag near the end of the sample, as well as the capital thorn. Note also that the sentence has been lopped off; presumably there was another CHARACTERS event before this one that contains the leading part of the sentence.

Why does the XMLEventReader screw up the parsing? Why are the characters not displaying correctly? Why does the XMLEventReader not coalesce CHARACTERS events, if that's what's going on? Why is StAX so unbelievably festeringly ugly and unpredictable?

I am using the XMLEventReader supplied to me by my Java runtime (Java 6) on a Mac.

Here is some sample XML, which of course I've simply copied from my editor, so who knows what character conversions occurred as a result of that, but anyhow:

<question id="BMHPD17">
  <permittedResponseCount>1</permittedResponseCount>
  <text>It’s hard for me to stay out of trouble. (Would you say this is true or false for you?)</text>
  <namedAnswerSet idref="TrueFalse"></namedAnswerSet>
</question>

Note the "smart apostrophe" on line 3.

I am reading this by reacting to a CHARACTERS event, saving its contents to a String on the stack, then reacting to an END_ELEMENT event whose name is "question". Upon receiving the END_ELEMENT event for question, I retrieve the value of the String I just mentioned, and construct a Java object that takes the string I just mentioned as input.

When I System.out.println() the result, I get (s开发者_如何学Goometimes) the bogus junk I referred to earlier.

When I wrap System.out inside a PrintWriter with "UTF8" encoding set, so that I'm not simply outputting characters according to the platform's encoding, I get the same results.

This turns out to be a bug on Mac OSX's JVM. The character encoding used by the console does not default to UTF-8, even though all other usages of the default character encoding are UTF8.

Is this even the same as the underlying SAX event, which includes a start offset and length? If so, you will probably find these specify a region of the string that excludes the markup.

继续阅读：stax xml

Why does XMLEventReader report a CHARACTERS event that contains markup?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？