Obtain XML entity replacement text from DOM in Xerces

2023-03-08 14:36 问答作者：

The Javadoc for org.w3c.dom.Entity states:

XML does not mandate that a non-validating XML processor read and process entity declarations made in the external subset or declared in parameter entities. This means that parsed entities declared in the external subset need not be expanded by some classes of applications, and that the replacement text of the entity may not be available. When the replacement text is available, the corresponding Entity node's child list represents the structure of that replacement value. Otherwise, the child list is empty.

Whilst it does not refer to entity declarations made in the internal subset, there must surely be some configuration of parser which will read and process entity declarations in either subset? Indeed, my reading of the documentation would suggest that this is the default.

In any event, I have tested the following approach (using Xerces) against entities which have been declared in the internal subset (as shown) and also in an external subset, but foo.hasChildNodes() returns false (and foo.getChildNodes() returns foo!) in every case:

开发者_如何学编程

// some trivial example XML
String xml = "<!DOCTYPE example [ <!ENTITY foo 'bar'> ]>\n<example/>";
InputStream is = new ByteArrayInputStream(xml.getBytes());

// parse
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
DocumentType docType = builder.parse(is).getDoctype();

// retrieve the entity - works fine
Entity foo = (Entity) docType.getEntities().getNamedItem("foo");

// now how to get the entity's replacement text?

No doubt I am missing something rather obvious; grateful for your thoughts.

EDIT

It appears from the answers so far that my Xerces implementation is misbehaving. I will try to update all Xerces libraries to latest versions and, if that solves my problem, I will close off the question. Many thanks.

UPDATE

Updating Xerces has indeed solved the problem, provided that the entity is referenced from within the document; if it is not, then the node still has no children. It is not entirely clear to me why this should be the case. Grateful if someone could explain what's going on and/or point me to how I can force the creation of the child nodes without explicitly referencing every entity from within the document.

I think you may be mistaken how the replacement text works. Based on some reading (http://www.javacommerce.com/displaypage.jsp?name=entities.sql&id=18238), it looks to me like the replacement text works like a variable. So, in your example above you are never referencing the &foo; entity. If you run the code sample below you will see that what happens is the &foo; gets replaced with the string bar:

// some trivial example XML
String xml = "<!DOCTYPE example [ <!ENTITY foo 'bar'> ]><example><foo>&foo;</foo></example>";
InputStream is = new ByteArrayInputStream(xml.getBytes());

// parse
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(is);
DocumentType docType = doc.getDoctype();

// retrieve the entity - works fine
Entity foo = (Entity) docType.getEntities().getNamedItem("foo");
for(int i = 0; i < foo.getChildNodes().getLength(); i++) {
  System.out.println(foo.getChildNodes().item(i));
}

What you see printed is [#text: bar] which is the text replacement within the XML.

I may be wrong, but I think Entity nodes store replacement text as text value, and not as set of nodes; this because entities are not actually fully parsed when parsing entity definitions: this mostly since DTD handler is sort of pre-processor that occurs before actual parsing process. So check out text value of entity node instead of children node list.

I don't know why foo.getChildNodes() doesn't work, but I discovered the following. If the entity is used (referenced) in the document,

<!DOCTYPE example [<!ENTITY foo 'bar'>]>\n<example>&foo;</example>,

then the replacement text is available via

foo.getTextContent()

I asked on the Xerces-J Users mailing list about the non-existence of child nodes where the entities are not referenced within the document; there Michael Glavassevich helpfully pointed me towards an old post from Andy Clark explaining as follows:

Unfortunately (for you) this is a feature. And it was implemented this way mainly for performance reasons. If an entity is never referenced in the document, then we never have to waste time reading it. If the external entity is huge but never referenced, we don't waste time or memory.

Plus, there is a deeper problem in relation to namespaces. DOM can't even help. I'll explain...

Take the following document and external entity:
  
  <hello/>

  
  <!DOCTYPE root [
  <!ENTITY entity SYSTEM 'entity.ent'>
  ]>
  <root>
    <sub xmlns='foo'> &entity; </sub>
    <sub xmlns='bar'> &entity; </sub>
  </root>
Notice that the default namespace is different at each point where the entity is referenced. This means that the element will be bound to different namespaces. So both instances of the same entity are actually different elements!

In this situation, what should the Entity node in the DOM doctype return: children in the "foo" namespace or children in the "bar" namespace?

In short, it's a complicated issue.

You might be best off trying to read the document fragment yourself when you look for the Entity node and it has no children. Xerces has a document fragment scanner in the impl package that would be useful for this purpose. You'd have to write code that builds children for a DOM document fragment from XNI methods, though. But this isn't hard to do. I can point you to an example if you need it.

继续阅读：dom entities xerces xml

Obtain XML entity replacement text from DOM in Xerces

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？