开发者

Split 1GB Xml file using Java

I have a 1GB Xml file, how can I split it into well-formed, smaller size Xml files using Java ?

Here is an example:

<records>
  <record id="001">
    开发者_Go百科<name>john</name>
  </record>
 ....
</records>

Thanks.


I would use a StAX parser for this situation. It will prevent the entire document from being read into memory at one time.

  1. Advance the XMLStreamReader to the local root element of the sub-fragment.
  2. You can then use the javax.xml.transform APIs to produce a new document from this XML fragment. This will advance the XMLStreamReader to the end of that fragment.
  3. Repeat step 1 for the next fragment.

Code Example

For the following XML, output each "statement" section into a file named after the "account attributes value":

<statements>
   <statement account="123">
      ...stuff...
   </statement>
   <statement account="456">
      ...stuff...
   </statement>
</statements>

This can be done with the following code:

import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;

public class Demo {

    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element

        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            File file = new File("out/" + xsr.getAttributeValue(null, "account") + ".xml");
            t.transform(new StAXSource(xsr), new StreamResult(file));
        }
    }

} 


Try this, using Saxon-EE 9.3.

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:mode streamable="yes"/>
    <xsl:template match="record">
      <xsl:result-document href="record-{@id}.xml">
        <xsl:copy-of select="."/>
      </xsl:result-document>
    </xsl:template>
</xsl:stylesheet>

The software isn't free, but if it saves you a day's coding you can easily justify the investment. (Apologies for the sales pitch).


DOM , STax, SAX all will do but have there own pros and cons.

  1. You can't put all the data in-memory in case of DOM.
  2. Programming control is easier in case of DOM then Stax and then SAX.
  3. A combination of SAX and DOM is a better option.
  4. Using a Framework which already does this can be the best option. Have a look at smooks.http://www.smooks.org

Hope this helps


I respectfully disagree with Blaise Doughan. SAX is not only hard to use, but very slow. With VTD-XML, you can not only use XPath to simplify processing logic (10x code reduction very common) but also much faster because there is no redundant encoding/decoding conversion. Below is the java code with vtd-xml

import java.io.FileOutputStream;
import com.ximpleware.*; 

public class split {
    public static void main(String[] args) throws Exception {       
        VTDGen vg = new VTDGen();       
        if (vg.parseHttpUrl("c:\\xml\\input.xml", true)) {
            VTDNav vn = vg.getNav();
            AutoPilot ap = new AutoPilot(vn);
            ap.selectXPath("/records/record");
            int i=-1,j=0;
            while ((i = ap.evalXPath()) != -1) {
            long l=vn.getElementFragment();
                (new FileOutputStream("out"+j+".xml")).write(vn.getXML().getBytes(), (int)l,(int)(l>>32));
                j++;
            }
        }
    }
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜