Is XslCompiledTransform to blame for slow XML transformation for a large file?
I am very new to XSLT, and the first thing that i need to do is parse a 300MB file (and that's开发者_如何学Go on the small end). The XSLT is not that complex for the moment, it's just removing some nodes that match a certain criteria. I have two problems:
- It's too slow. It takes 50 seconds to process 500,000 records and that's not fast enough.
- It consumes 500MBs of memory, so this will only get worse when the files will get bigger.
Is there anything i can do natively in .net to make is perform better?
I know I can look into SAX based parsing, or STX (which is mentioned in another post), but I would prefer to stay within the .net boundaries.
Thank you!
EDIT: Here's my XSLT
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:test="http://schemas....">
<xsl:output omit-xml-declaration="yes"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="test:QueryRow[test:Columns/test:QueryColumn[test:Name='hit_count' and test:Value>200]]"/>
</xsl:stylesheet>
Here's the code i use to do the transform
XslCompiledTransform compiledTransform = new XslCompiledTransform();
XsltSettings settings = new XsltSettings();
settings.EnableScript = true;
XmlReader xmlReader = XmlReader.Create("in.xml");
XmlWriter xmlWriter = XmlWriter.Create("out.xml");
compiledTransform.Load("format.xslt", settings, null);
compiledTransform.Transform(xmlReader, xmlWriter); //this is what takes a long time
At the moment I am trying to just read the file in, and write it back out, but it seems to actually be reading the whole file into memory, so I am trying to find a way to read it line by line.
Try profiling your XSLT. oXygen has a nice profiling capability that can tell you where the hot spots are in your transforms.
You could have some inefficient XPATH expressions (e.g. //*), or have logic buried inside of your templates(e.g. lots of for-each, if, choose, etc) that is preventing the XSLT engine from optimizing. Moving some of that logic up into the template match criteria can help the engine optimize and reduce the size of the node sets that you iterate over and evaluate.
The XPath expression you're filtering on doesn't have anything obviously wrong with it, as such. But it's easy to envision it being a problem. If your QueryRow
elements all have 20 Column
children, each of which has 20 QueryColumn
children, the XSLT processor is going to have to examine 400 elements before deciding that a given QueryRow
element doesn't match. That's conceivably pretty inefficient, because if it turns out that the element shouldn't be filtered, the XSLT processor then has to visit all 400 elements again to output them all.
The .NET way to implement SAX-like XML parsing is to subclass XmlReader
, which you could conceivably do in this case: you basically build an XmlReader
that buffers QueryRow
elements as it reads their descendants until it determines that they're OK, and then returns them to the caller of the Read
method. That's going to be considerably faster than using XSLT to filter the XML, since using an XmlReader
doesn't require you to build an in-memory representation of the unfiltered XML document before you can filter it.
You could try checking out Saxon, which I hear is a very good and efficient XSLT processor. But the full XSLT is not possible to process in a streaming manner, even though your transform sounds like it could be, so unless the XSLT processor is very good at optimizing (as I understand, Saxon is one of the best, if not the best), your memory consumption problems may not be solvable.
精彩评论