开发者

Parse Ampersand in XML with Java's DOM XML API

I am trying to parse an XML document with the Java DOM API (not SAX). Whenever the parser encounters the ampersand (&) when parsing a text node, it errors out. I am guessing that this is solvable with 1)escaping, 2) encoding or 3) Use a different parser.

I am reading an XML document that I dont have any control over, so I cannot precisely identify where the ampersand appears in the document every time I read it.

The answers I have s开发者_运维知识库een to similar questions have advised replacing the entity type when parsing the XML, but I am not sure how I will be able to do that since, it doesnt even parse when it encounters the XML ampersand.

Any help will be appreciated.


As noted, the XML is malformed (oops!): all occurrences of & in XML (other than the token introducing a character entity [?]) must be encoded as &.

Some solutions (which are basically just as described in the post!):

  1. Fix the XML (at source, or in hack-it-up phase), or;
  2. Parse it with the "appropriate" tool (e.g. a "forgiving" HTML parser)

For the "hack-it-up" approach, consider a separate input stream -- see Working with Filter Streams -- that executes as a filter prior to the actual DOM parser: whenever a & is encountered (that is not part of a character entity) it "fixes it" by inserting & into the stream. Of course, if the XML source didn't get basic encoding correct...

Happy coding.


"I am reading an XML document that I dont have any control over".

No, you are reading a non-XML document. The reason you get an error is that XML parsers are required to give you an error when you read something that isn't XML.

The XML culture is that responsibility for producing well-formed XML rests with the sender. You need to change whatever produces this data to do it properly. Otherwise, you might as well forget XML and its benefits, and move back to the chaotic world of privately-agreed protocols and custom parsers.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜