开发者

Transforming an xml to another xml with STaX takes a lot of time

I'm using the following code to transform a big xml stream to another stream:

 import java.io.ByteArrayInputStream;
 import java.io.InputStreamReader;
 import java.io.OutputStreamWriter;
 import java.io.PrintWriter;
 import java.io.Writer;
 import javax.xml.stream.XMLEventReader;
 import javax.xml.stream.XMLEventWriter;
 import javax.xml.stream.XMLInputFactory;
 import javax.xml.stream.XMLOutputFactory;
 import javax.xml.stream.XMLStreamException;
 import javax.xml.stream.XMLStreamReader;
 import javax.xml.stream.events.XMLEvent;
 import javax.xml.transform.Result;
 import javax.xml.transform.Source;
 import javax.xml.transform.Transformer;
 import javax.xml.transform.TransformerFactory;
 import javax.xml.transform.stax.StAXResult;
 import javax.xml.transform.stax.StAXSource;

 public class TryMe 
 {
   public static void main (final String[] args)
   {
    XMLInputFactory inputFactory = null;
    XMLEventReader eventReaderXSL = null;
    XMLEventReader eventReaderXML = null;
    XMLOutputFactory outputFactory = null;
    XMLEventWriter eventWriter = null;
    Source XSL = null;
    Source XML = null;
    inputFactory = XMLInputFactory.newInstance();
    outputFactory = XMLOutputFactory.newInstance();
    inputFactory.setProperty("javax.xml.stream.isSupportingExternalEntities", Boolean.TRUE);
    inputFactory.setProperty("javax.xml.stream.isNamespaceAware", Boolean.TRUE);
    inputFactory.setProperty("javax.xml.stream.isReplacingEntityReferences", Boolean.TRUE);
    try
    {
        eventReaderXSL = inputFactory.createXMLEventReader("my_template",
                new InputStreamReader(TryMe.class.getResourceAsStream("my_template.xsl")));
        eventReaderXML = inputFactory.createXMLEventReader("big_one", new InputStreamReader(
                TryMe.class.getResourceAsStream("big_one.xml")));
    }
    catch (final javax.xml.stream.XMLStreamException e)
    {
        System.out.println(e.getMessage());
    }

    // get a TransformerFactory object
    final TransformerFactory transfFactory = TransformerFactory.newInstance();

    // define the Source object for the stylesheet
    try
    {
        XSL = new StAXSource(eventReaderXSL);
    }
    catch (final javax.xml.stream.XMLStreamException e)
    {
        System.out.println(e.getMessage());
    }
    Transformer tran2 = null;
    // get a Transformer object
    try
    {

        tran2 = transfFactory.newTransformer(XSL);
    }
    catch (final javax.xml.transform.TransformerConfigurationException e)
    {
        System.out.println(e.getMessage());
    }

    // define the Source object for the XML document
    try
    {
        XML = new StAXSource(eventReaderXML);
    }
    catch (final javax.xml.stream.XMLStreamException e)
    {
        System.out.println(e.getMessage());
    }

    // create an XMLEventWriter object
    try
    {

        eventWriter = outputFactory.createXMLEventWriter(new OutputStreamWriter(System.out));
    }
    catch (final javax.xml.stream.XMLStreamException e)
    {
        System.out.println(e.getMessage());
    }

    // define the Result object
    final Result XML_r = new StAXResult(eventWriter);

    // call the transform method
    try
    {

        tran2.transform(XML, XML_r);
    }
    catch (final javax.xml.transform.TransformerException e)
    {
        System.out.println(e.getMessage());
    }

    // clean up
    try
    {
        eventReaderXSL.close();
        eventReaderXML.close();
        eventWriter.close();
    }
    catch (final javax.xml.stream.XMLStreamException e)
    {
        System.out.println(e.getMessage());
    }
}

}

my_template is something like this:

<xsl:stylesheet version = '1.0' 
     xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>

<xsl:preserve-space elements="*"/>

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>


<xsl:template match="@k8[parent::point]">
  <xsl:attribute name="k8">
    <xsl:value-of select="'xxxxxxxxxxxxxx'"/>
  </xsl:attribute>
</xsl:template>

</xsl:stylesheet>

and xml is a long long list of

<data>
  <point .... k8="blablabla" ... ></point>
  <point .... k8="blablabla" ... ></point>
  <point .... k8="blablabla" ... ></point>
  ....
  <point .... k8="blablabla" ... ></point>
</data>

If i use an identity transformer (开发者_如何学Cusing tranfsFactory.newTransformer() instead of transFactory(XSL) ) while the input stream is processed the output is produced. Instead with my template there's no way.. The transformer reads all the input and then starts to produce the output (with a large stream of course very often an out of memory comes before a result.

Any Idea?? i'm freaking out.. i can't understand what's wrong in my code/xslt

Many thanks in advance!!


Well XSLT 1.0 and 2.0 operate on a tree data model of the complete XML so XSLT 1.0 and 2.0 processors usually read the complete XML input document into a tree and create a result tree that is then serialized. You seem to assume that using StAX changes the behaviour of XSLT but I don't think that is the case, the XSLT processor builds the tree as the stylessheet could require complex XPath navigator like preceding or preceding-sibling.

However as you use Java you could look into Saxon 9.3 and its experimental XSLT 3.0 streaming support, that way you should not run out of memory when processing very large XML input documents.

The part in your XSLT that is unusual is <xsl:template match="@k8[parent::point]">, that is usually simply written as <xsl:template match="point/@k8"> but you would need to test with your XSLT processor whether that changes performance.


Using XSLT is probably not the best approach, as others have pointed out your solution requires that the processor reads the entire document into memory before writing out the output. You might wish to consider using a SAX parser to sequentially read in each node, perform any transformation required (using a data driven mapping if necessary) and write out the transformed data. This avoids the requirement to create an entire document tree in memory and could enable significantly faster processing as you're not attempting to build a complex document to write out.

Ask yourself if the output format is simple and stable, and then reconsider the use of XSLT. For large datasets of regular data, you might also wish to consider if XML is a good file format for transferring information.


The transformer reads all the input and then starts to produce the output (with a large stream of course very often an out of memory comes before a result.

Any Idea?

If you are finding that it takes too long for this work to complete, then you need to redesign your approach to your task to avoid reading in the entire input file before you start to process the output file. There is nothing that can be tweaked with your code to make it magically faster - you need to address the core of your algorithm.


How complex is the transformation you are doing with XSL? Could you make the same transformation using StAX alone?

With StAX it is quite easy to write a parser to match a particular node and then to insert, alter or remove nodes in the output stream you are writing to at that point. So instead of using XSL for the transform, you could maybe use StAX alone. This way you benefit from the streaming nature of the API (not buffering large tree in memory) and so there will be no memory issue.

Co-incidentally, this recent answer to another question might help you with that.


As others have pointed, using Stax won't change the way XSLT is working : It reads first everything before starting any work. If you need to work with very large files, you'll have to use something other than XSLT.

Then are different options:

  • ReWrite the transformation using SAX Pipelines in Java.
  • Rewrite the transformation using STX/Joost which is a streaming version of XSLT
  • Rewrite the transformation using Scala's Freetle XML transformation framework.


Try apache xsltc for better performance - it uses code generation to simply transforms.

Your XSLt transform looks really simple, and so does your input format - surely you can do StAX/SAX manual processing and gain a really good performance increase.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜