How to use ANTLR to parse xml document
can anybody tell how to use ANTLR tool(in java) to create our own grammar for xml documents and how to parse those do开发者_开发知识库cuments using ANTLR tool(in java)?
Check out ANTXR, my ANTLR derivation that supports XML tags in the grammar itself. You can use SAX or XMLPull as a front end. (Note: it's based on ANTLR 2.x)
http://javadude.com/tools/antxr/index.html
Short example:
header {
package com.javadude.antlr.sample.xml;
import java.util.List;
import java.util.ArrayList;
}
class PeopleParser extends Parser;
document returns [List results = null]
: results=<people> EOF
;
<people> returns [List results = new ArrayList()]
{ Person p; }
: ( p=<person> { results.add(p); } )*
;
<person> returns [Person p = new Person()]
{
String first, last;
p.setId(@id); // attributes are read using "@xxxx"
}
: ( first=<firstName> { p.setFirstName(first); }
| last=<lastName> { p.setLastName(last); }
)*
;
<firstName> returns [String value = null]
: pcdata:PCDATA { value = pcdata.getText(); }
;
<lastName> returns [String value = null]
: pcdata:PCDATA { value = pcdata.getText(); }
;
If you want to write a completely conforming (even non-validating) XML parser you must read the W3C specification (http://www.w3.org/TR/REC-xml/). You will need to deal with internal and external DTD subsets, parameter entities and general entities. This will be a major task, even with ANTLR. You will need to be able to resolve URLs and deal with namespaceURIs. And a lot more.
I suspect that you wish to parse only a subset (though I don't think it's a good idea to write non-conformant parsers for standards). In which case the first thing is to write the EBNF for your subset. Then it should be fairly straightforward :-)
EDIT To make it very clear: anything that does not conform to the complete spec is NOT XML. You talk about creating your "own grammar" for XML, but there is already a defined grammar for XML which cannot be modified. If you wish to create your own syntax which is "like XML" you can, but anyone who thinks it actually IS XML will be disapppointed as there are many XML constructs you won't support (or will support differently).
精彩评论