Random queries on a large xml file
I have a large xml file (1Gb). I need to make many queries on this xml file (using xpath for example). The results are small parts of the xml. I want the queries to be as fast as possible but the 1Gb file is probably too large for working memory.
The xml looks something like this:
<all>
<record>
<id>1</id>
... lots of fields. (Very different fields per record including (sometimes) subrecords
so mapping on a relational database would be hard).
</record>
<record>
<id>2</id>
... lots of fields.
</record>
.. lots and lots and lots of records
</all>
I need random access, selecting records using for instance as an key. (Id is most important, but other fields might be used as key too). I d开发者_如何转开发on't know the queries in advance, they arrive and have to be executed ASAP, no batch executing but real time. SAX does not look very promising because I don't want to reread the entire file for every query. But DOM doesn't look very promising either, because the file is very large and adding additional structure overhead will almost certainly mean that it is not going to fit in working memory.
Which java library / approach could I use best to handle this problem?
When handling XML you generally have two approaches: streaming (SAX) or loading the entire document into memory (various DOM implementations).
If you can pre-establish a set of queries to be processed in bulk, you could write a program to use SAX to stream the file, looking for matches. If the queries come in at random intervals (i.e. a typical database application) then you will need to either load the entire document into memory, or preprocess the XML document into a database of some kind.
A better description of what you're trying to accomplish might help get better answers.
vtd-xml is the best-fit for your usecase. http://vtd-xml.sourceforge.net/
Piccolo is a small, extremely fast XML parser for Java. It implements the SAX 1, SAX 2.0.1, and JAXP 1.1 (SAX parsing only)
interfaces as a non-validating parser. It's available on Apache's License
depending of the application using a xml orientated database such http://exist.sourceforge.net/ could be interesting.
精彩评论