开发者

XML: Process large data

What XML-parser do you recommend for the following purpose:

The XML-file (formatted, containing whitespaces) is around 800 MB. It mostly contains three types of tag (let's call them n, w and r). They have an attribute called id which i'd have to search for, as fast as possible.

Removing attributes I don't need could save around 30%, maybe a bit more.

First part for optimizing the second part: Is there any good tool (command line linux and windows if possible) to easily remove unused attributes in certain tags? I know that XSLT could be used. Or are there any easy alternatives? Also, I could split it into three files, one for each tag to gain speed for later parsing... Speed is not too important for this preparation of the data, of course it would be nice when it took rather minutes than hours.

Second part: Once I have开发者_StackOverflow the data prepared, be it shortened or not, I should be able to search for the ID-attribute I was mentioning, this being time-critical.

Estimations using wc -l tell me that there are around 3M N-tags and around 418K W-tags. The latter ones can contain up to approximately 20 subtags each. W-Tags also contain some, but they would be stripped away.

"All I have to do" is navigating between tags containing certain id-attributes. Some tags have references to other id's, therefore giving me a tree, maybe even a graph. The original data is big (as mentioned), but the resultset shouldn't be too big as I only have to pick out certain elements.

Now the question: What XML parsing library should I use for this kind of processing? I would use Java 6 in a first instance, with having in mind to be porting it to BlackBerry.

Might it be useful to just create a flat file indexing the id's and pointing to an offset in the file? Is it even necessary to do the optimizations mentioned in the upper part? Or are there parser known to be quite as fast with the original data?

Little note: To test, I took the id being on the very last line on the file and searching for the id using grep. This took around a minute on a Core 2 Duo.

What happens if the file grows even bigger, let's say 5 GB?

I appreciate any notice or recommendation. Thank you all very much in advance and regards


As Bouman has pointed out, treating this as pure text processing will give you the best possible speed.

To process this as XML the only practical way is to use a SAX parser. The Java APIs build in SAX parser is perfectly capable of handling this so there is no need to install any third party libraries.


I'm using XMLStarlet ( http://xmlstar.sourceforge.net/ ) for working with huge XML files. There are versions for both linux and windows.


Large XML files and Java heap space are a vexed issue. StAX works on big files - it certainly handles 1GB without batting an eyelid. There's a useful article on the subject of using StAx here: XML.com which got me up and running with it in about 20 minutes.


What XML-parser do you recommend for the following purpose: The XML-file (formatted, containing whitespaces) is around 800 MB.

Perhaps you should take a look at VTD-XML: http://en.wikipedia.org/wiki/VTD-XML (see http://sourceforge.net/projects/vtd-xml/ for download)

It mostly contains three types of tag (let's call them n, w and r). They have an attribute called id which i'd have to search for, as fast as possible.

I know it's blasphemy but have you considered awk or grep to preprocess? I mean, I know you can't actually parse xml and detect errors in nested structures like XML with that, but perhaps your XML is in such a form that it might just happens to be possible?

I know that XSLT could be used. Or are there any easy alternatives?

As far as I know XSLT processors operate on a DOM tree of the source document...so they'd need to parse and load the entire document into memory...probably not a good idea for a document this large (or perhaps you have enough memory for that?) There is something called streaming XSLT but I think the technique is quite young and there aren't many implementations around, none free AFAIK so you could try.


"I could split it into three files"

Try XmlSplit. it is a commandline program with options for specifying where to split by element, attribute, etc. Google and you should find it. Very fast too.


xslt tends to be comparatively quite fast even for large files. For large files, the trick is not creating the DOM first. Use a URL Source or a stream source to pass to the transformer.

To strip the empty nodes and unwanted attributes start with the Identity Transform template and filter them out. Then use XPATH to search for your required tags.

You could also try a bunch of variations:

  • Split the large XML files into smaller ones and still preserve their composition using the XML-Include. It is very much similar to splitting large source files into smaller ones and using the include "x.h" kind of concept. This way, you may not have to deal with large files.

  • When you run your XML through the Identity Transform, use it to assign a UNID for each node of interest using the generated-id() function.

  • Build a front-end database table for searching. Use the above generated UNID to quickly pinpoint the location of the data in a file.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜