Java: How to split XML stream into small XML documents? XPath on streaming XML parser?
I need to read a large XML document from the network and split it up into smaller XML documents. In particular the stream I read from the network looks something like this:
<a>
<b>
...
</b>
<b>
...
</b>
<b>
...
</b>
<b>
...
</b>
....
</a>
I need to break this up into chunks of
<a> <b> ... </b> <a>
(I only actually need the <b> .... </b>
parts as long as the namespace bindings declared higher up (e.g. in <a>
) are moved to <b>
if that makes it easier).
The file is too big for a DOM style parser, it has to be done streaming. Is there any XML library that can do this?
[Edit]
I think what I'm ideally looking for is something like the ability to do XPath queries on an XML stream where the stream parser only parses as far as necessary to return the next item in the result node set (and all its attributes and children). Doesn't have to be XPath, but something along the idea.
开发者_高级运维Thanks!
The JAXP SAX api with SAX filter is both fast and efficient. Good intro filters can be seen here
As a XML splitter, VTD-XML is ideally suited for this task... it is also more memory efficient than DOM. The key method that simplify coding is VTDNav's getElementFragment()... below is the Java code for split input.xml into out0.xml and out1.xml
<a> <b> text1 </b> <b> text2 </b> </a>
into
<a> <b> text1</b> </a>
and
<a> <b> text2</b> </a>
using XPath
/a/b
The code
import java.io.*;
import com.ximpleware.*;
public class split {
public static void main(String[] argv) throws Exception{
VTDGen vg = new VTDGen();
if (vg.parseFile("c:/split/input.xml", true)){
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/a/b");
int i=-1,k=0;
byte[] ba = vn.getXML().getBytes();
while((i=ap.evalXPath())!=-1){
FileOutputStream fos = new FileOutputStream("c:/split/out"+k+".xml");
fos.write("<a>".getBytes());
long l = vn.getElementFragment();
fos.write(ba, (int)l, (int)(l>>32));
fos.write("</a>".getBytes());
k++;
}
}
}
}
For further reading, please visit http://www.devx.com/xml/Article/36379
go old school
StringBuilder buffer = new StringBuilder(1024 * 50);
BufferedReader reader = new BufferedReader(new FileReader(pstmtout));
String line;
while ((line = reader.readLine()) != null) {
buffer.append(line);
if (line.equalsIgnoreCase(endStatementTag)) {
service.handle(buffer.toString());
buffer.delete(0, buffer.length());
}
}
You can do this with XProc language
<?xml version="1.0" encoding="ISO-8859-1"?>
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0">
<p:load href="in/huge-document.xml"/>
<p:for-each>
<p:iteration-source select="/a/b"/>
<p:wrap match="/b" wrapper="a"/>
<p:store>
<p:with-option name="href" select="concat('part', p:iteration-position(), '.xml')">
<p:empty/>
</p:with-option>
</p:store>
</p:for-each>
</p:declare-step>
You can use QuiXProc (Streaming XProc implementation : http://code.google.com/p/quixproc/ ) to try to stream it also
I happen to like the XOM XML library, as its interface is simple, intuitive and powerful. To do what you want with XML, you can use your own NodeFactory and (for example) override the finishMakingElement()
method. If it is making the element that you want (in your case, <b>
) then you pass it along to whatever you need to do with it.
精彩评论