Split 1GB Xml file using Java
I have a 1GB Xml file, how can I split it into well-formed, smaller size Xml files using Java ?
Here is an example:
<records>
<record id="001">
开发者_Go百科<name>john</name>
</record>
....
</records>
Thanks.
I would use a StAX parser for this situation. It will prevent the entire document from being read into memory at one time.
- Advance the XMLStreamReader to the local root element of the sub-fragment.
- You can then use the javax.xml.transform APIs to produce a new document from this XML fragment. This will advance the XMLStreamReader to the end of that fragment.
- Repeat step 1 for the next fragment.
Code Example
For the following XML, output each "statement" section into a file named after the "account attributes value":
<statements>
<statement account="123">
...stuff...
</statement>
<statement account="456">
...stuff...
</statement>
</statements>
This can be done with the following code:
import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
xsr.nextTag(); // Advance to statements element
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
File file = new File("out/" + xsr.getAttributeValue(null, "account") + ".xml");
t.transform(new StAXSource(xsr), new StreamResult(file));
}
}
}
Try this, using Saxon-EE 9.3.
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:mode streamable="yes"/>
<xsl:template match="record">
<xsl:result-document href="record-{@id}.xml">
<xsl:copy-of select="."/>
</xsl:result-document>
</xsl:template>
</xsl:stylesheet>
The software isn't free, but if it saves you a day's coding you can easily justify the investment. (Apologies for the sales pitch).
DOM , STax, SAX all will do but have there own pros and cons.
- You can't put all the data in-memory in case of DOM.
- Programming control is easier in case of DOM then Stax and then SAX.
- A combination of SAX and DOM is a better option.
- Using a Framework which already does this can be the best option. Have a look at smooks.http://www.smooks.org
Hope this helps
I respectfully disagree with Blaise Doughan. SAX is not only hard to use, but very slow. With VTD-XML, you can not only use XPath to simplify processing logic (10x code reduction very common) but also much faster because there is no redundant encoding/decoding conversion. Below is the java code with vtd-xml
import java.io.FileOutputStream;
import com.ximpleware.*;
public class split {
public static void main(String[] args) throws Exception {
VTDGen vg = new VTDGen();
if (vg.parseHttpUrl("c:\\xml\\input.xml", true)) {
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/records/record");
int i=-1,j=0;
while ((i = ap.evalXPath()) != -1) {
long l=vn.getElementFragment();
(new FileOutputStream("out"+j+".xml")).write(vn.getXML().getBytes(), (int)l,(int)(l>>32));
j++;
}
}
}
}
精彩评论