what is the most efficient way to count xml nodes in Java

2023-01-08 05:16 问答作者：

I have a huge XML files up to 1-2gb, and obviously I can't parse the whole file at once, I'd have to split it into parts then parse the parts and do whatever with them.

How can I count number of a certain node? So I can keep track on how many parts do I need to split the file. Is there a maybe better way to do this? I'm open to all suggestions thank you

Question update:

Well I did use STAX, maybe the logic I'm using it for is wrong, I'm parsing the file, then for each node I'm getting the node value and store it inside string builder. Then in another method I go trough stringbuilder and edit the output. Then I write that output to the file. I can do no more than 10000 objects like this.

Here is the exception I get :

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at com.sun.org.apache.xerces.internal.util.NamespaceSupport.<init>(Unkno
wn Source)
        at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.setNamespace
Context(Unknown Source)
        at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.getXMLEvent(
Unknown Source)
        at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.allocate(Unk
nown Source)
        at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Sour
ce)
        at com.sun.org.apache.xalan.internal.xsltc.trax.StAXEvent2SAX.bridge(Unk
nown Source)
        at com.sun.org.apache.xalan.internal.xsltc.trax.StAXEvent2SAX.parse(Unkn
own Source)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transfor开发者_JS百科
mIdentity(Unknown Source)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transfor
m(Unknown Source)
        at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transfor
m(Unknown Source)

Actually I think my whole approach is wrong, what I'm actually trying convert xml files into CSV samples. Here is how I do it so far :

Read/parse xml file
For each element node get text node value
Open stream write it to file(temp), for n nodes then flush and close stream
Then open another stream read from temp, use commons strip utils and some other stuff to create proper csv output then write it to csv file

The SAX or STAX API would be your best bet here. They don't parse the whole thing at once, they take one node at a time and let your app process it. They're good for arbitrarily large documents.

SAX is the older API, and works on a push model, STAX is newer and is a pull parser, and is therefore rather easier to use, but for your requirements, either one would be fine.

See this tutorial to get you started with STAX parsing.

You can use a streaming parser like StAX for this. This will not require you to read the entire file in memory at once.

I think you want to avoid creating a DOM, so SAX or StAX should be good choices.

With SAX just implement a simlpe content handler that just increments a counter if an interesting element is found.

With SAX you don't have to split the file: It's streaming, so it holds only the current bits in memory. It's very easy to write a ContentHandler that just does the counting. And it's very fast (in my experience, almost as fast as simply reading the file).

Well I did use STAX, maybe the logic I'm using it for is wrong, I'm parsing the file, then for each node I'm getting the node value and store it inside string builder. Then in another method I go trough stringbuilder and edit the output. Then I write that output to the file. I can do no more than 10000 objects like this.

By this description, I'd say yes, the logic you're using it for is wrong. You're holding on to too much in memory.

Rather than parsing the entire file, storing all the node values into something and then processing the result, you should handle each node as you hit it, and output while parsing.

With more details on what you're actually trying to accomplish and what the input XML and out whatever looks like, we could probably help streamline.

You'd be better off using an event based parser such as SAX

I think splitting the file is not the way to go. You'd better handle the xml file as a stream and use the SAX API (and not the DOM API).

Even better, you should use XQuery to handle you requests.

Saxon is a good Java / .Net implementation (using sax), that is amazingly fast, even on big files. Version HE is under a MPL open-source license.

Here is a little example:

java -cp saxon9he.jar net.sf.saxon.Query -qs:"count(doc('/path/to/your/doc/doc.xml')//YouTagToCount)"

With extended vtd-xml, you can load document in memory efficient as it supports memory mapping. Compared to DOM, the memory usage won't explode in an order of magnitude. And you will be able to use xpath to count the number of nodes very easily.

继续阅读：xml

what is the most efficient way to count xml nodes in Java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？