开发者

Obtain XML entity replacement text from DOM in Xerces

The Javadoc for org.w3c.dom.Entity states:

XML does not mandate that a non-validating XML processor read and process entity declarations made in the external subset or declared in parameter entities. This means that parsed entities declared in the external subset need not be expanded by some classes of applications, and that the replacement text of the entity may not be available. When the replacement text is available, the corresponding Entity node's child list represents the structure of that replacement value. Otherwise, the child list is empty.

Whilst it does not refer to entity declarations made in the internal subset, there must surely be some configuration of parser which will read and process entity declarations in either subset? Indeed, my reading of the documentation would suggest that this is the default.

In any event, I have tested the following approach (using Xerces) against entities which have been declared in the internal subset (as shown) and also in an external subset, but foo.hasChildNodes() returns false (and foo.getChildNodes() returns foo!) in every case:

开发者_如何学编程
// some trivial example XML
String xml = "<!DOCTYPE example [ <!ENTITY foo 'bar'> ]>\n<example/>";
InputStream is = new ByteArrayInputStream(xml.getBytes());

// parse
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
DocumentType docType = builder.parse(is).getDoctype();

// retrieve the entity - works fine
Entity foo = (Entity) docType.getEntities().getNamedItem("foo");

// now how to get the entity's replacement text?

No doubt I am missing something rather obvious; grateful for your thoughts.


EDIT

It appears from the answers so far that my Xerces implementation is misbehaving. I will try to update all Xerces libraries to latest versions and, if that solves my problem, I will close off the question. Many thanks.


UPDATE

Updating Xerces has indeed solved the problem, provided that the entity is referenced from within the document; if it is not, then the node still has no children. It is not entirely clear to me why this should be the case. Grateful if someone could explain what's going on and/or point me to how I can force the creation of the child nodes without explicitly referencing every entity from within the document.


I think you may be mistaken how the replacement text works. Based on some reading (http://www.javacommerce.com/displaypage.jsp?name=entities.sql&id=18238), it looks to me like the replacement text works like a variable. So, in your example above you are never referencing the &foo; entity. If you run the code sample below you will see that what happens is the &foo; gets replaced with the string bar:

// some trivial example XML
String xml = "<!DOCTYPE example [ <!ENTITY foo 'bar'> ]><example><foo>&foo;</foo></example>";
InputStream is = new ByteArrayInputStream(xml.getBytes());

// parse
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(is);
DocumentType docType = doc.getDoctype();

// retrieve the entity - works fine
Entity foo = (Entity) docType.getEntities().getNamedItem("foo");
for(int i = 0; i < foo.getChildNodes().getLength(); i++) {
  System.out.println(foo.getChildNodes().item(i));
}

What you see printed is [#text: bar] which is the text replacement within the XML.


I may be wrong, but I think Entity nodes store replacement text as text value, and not as set of nodes; this because entities are not actually fully parsed when parsing entity definitions: this mostly since DTD handler is sort of pre-processor that occurs before actual parsing process. So check out text value of entity node instead of children node list.


I don't know why foo.getChildNodes() doesn't work, but I discovered the following. If the entity is used (referenced) in the document,

<!DOCTYPE example [<!ENTITY foo 'bar'>]>\n<example>&foo;</example>,

then the replacement text is available via

foo.getTextContent()


I asked on the Xerces-J Users mailing list about the non-existence of child nodes where the entities are not referenced within the document; there Michael Glavassevich helpfully pointed me towards an old post from Andy Clark explaining as follows:

Unfortunately (for you) this is a feature. And it was implemented this way mainly for performance reasons. If an entity is never referenced in the document, then we never have to waste time reading it. If the external entity is huge but never referenced, we don't waste time or memory.

Plus, there is a deeper problem in relation to namespaces. DOM can't even help. I'll explain...

Take the following document and external entity:

  <!-- entity.ent -->
  <hello/>

  <!-- document.xml -->
  <!DOCTYPE root [
  <!ENTITY entity SYSTEM 'entity.ent'>
  ]>
  <root>
    <sub xmlns='foo'> &entity; </sub>
    <sub xmlns='bar'> &entity; </sub>
  </root>

Notice that the default namespace is different at each point where the entity is referenced. This means that the element will be bound to different namespaces. So both instances of the same entity are actually different elements!

In this situation, what should the Entity node in the DOM doctype return: children in the "foo" namespace or children in the "bar" namespace?

In short, it's a complicated issue.

You might be best off trying to read the document fragment yourself when you look for the Entity node and it has no children. Xerces has a document fragment scanner in the impl package that would be useful for this purpose. You'd have to write code that builds children for a DOM document fragment from XNI methods, though. But this isn't hard to do. I can point you to an example if you need it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜