开发者

Validating XML with DTD fails to import entity using lxml

I have a tool producing NewsML type XML files and I want to validate them after producing the files. I'm receiving an error:

Attempt to load network entity http://www.w3.org/TR/ruby/xhtml-ruby-1.mod

The python call is:

parser = etree.XMLParser(load_dtd=True, dtd_validation=True)
treeObject = etree.parse(f, parser)

First I'm not sure if I need both "load_dtd=True, dtd_validation=True" but I'm using it anyway. Second error seems to be coming from an imported nitf-3-4.dtd that's defined as:

<!ENTITY % xhtml-ruby.mod PUBLIC 
    "-//W3C//ELEMENTS XHTML Ruby 1.0//EN" "http://www.w3.org/TR/ruby/xhtml-ruby-1.mod">
%xhtml-ruby.mod;
开发者_运维知识库

Will lxml go out and retrieve this xhtml-ruby-1.mod or do I have to have all the DTD files locally.


Try constructing the parser with no_network=False. As stated in the documentation:

no_network - prevent network access when looking up external documents (on by default)

Imported dtd modules should get retrieved by lxml, but it will not be able to do so if network access is not allowed (this does not count for the document itself, only for loading external referenced documents. In fact, I would expect you to get errors loading the dtd itself, so I assume the document refers to a locally available copy of that dtd, and that it is only the dtd itself that references a remote resource?)

You could also use a catalog to use locally available copies (not only circumventing this problem, but also more performant, and friendlier towards the w3c servers ;-)). Libxml2 (used by lxml) will check for the existance of a catalog in /etc/xml/catalog, and the XML_CATALOG_FILES environment variable (see Libxml2 docs)

(it is also possible to write your own resolvers for lxml to intercept and handle requests, but that would probably be overkill in this case)

Note that there is also another option besides parse time validation: use the DTD class to load the dtd separately, and use that as a validator.

This will validate the parsed document with the provided dtd regardless of which dtd (if any) is referenced by doctype declaration (which can be handy: not every valid xml file is necessarily valid according to the dtd you want).

Because the dtd will only have to be retrieved and parsed once, this should be faster if you're validating a lot of documents), and (if I'm not mistaken), you won't run into the no_network problem.

Another bonus of this approached: you can even validate your elements/elementtrees before you've serialized them (if your producing tool uses lxml that is).

A final note: some documents can only be parsed if you have access to the dtd at parse time (unresolvable entities...). Avoid this if you can. (and, although not everyone would agree: avoid doctype declarations altogether if possible).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜