How can I efficiently parse 200,000 XML files in Java?

2023-02-22 05:58 问答作者：

I have 200,000 XML files I want to parse and store in a database.

Here is an example of one: https://gist.github.com/902292

This is about as complex as the XML files get. This will also run on a small VPS (Linode) so memory 开发者_C百科is tight.

What I am wondering is:

1) Should I use a DOM or SAX parser? DOM seems easier and faster since each XML is small.

2) Where is a simple tutorial on said parser? (DOM or SAX)

Thanks

EDIT

I tried the DOM route even though everyone suggested SAX. Mainly because I found an "easier" tutorial for DOM and I thought that since the average file size was about 3k - 4k it would easily be able to hold that in memory.

However, I wrote a recursive routine to handle all 200k files and it gets about 40% of the way through them and then Java runs out of memory.

Here is part of the project. https://gist.github.com/905550#file_xm_lparser.java

Should I ditch DOM now and just use SAX? Just seems like with such small files DOM should be able to handle it.

Also, the speed is "fast enough". It's taking about 19 seconds to parse 2000 XML files (before the Mongo insert).

Thanks

Why not use a proper XML database (like Berkeley DB XML)? Then you can just dump the documents in directly, and create indices as needed (e.g. on the HotelID).

divide and conquer Split 200,000 files into multiple buckets and parallelize the parse/insert. Look at Java 5 Executors if you want to keep it simple or use spring-batch if this is a recurring task in which case you can benefit from a high level framework.

API

Use of SAX can help but not necessary as you are not going to keep the parsed model around (i.e. all you are doing is parsing, inserting and then let go of the parsed data at which time the objects are eligible for GC). Look into a simple API like JDOM.

Other ideas

You can implement a producer/consumer kind of model where producer produces the pojo's created after parsing and consumer takes the pojo's and inserts them into db. the advantage here is that you can batch the inserts to gain more performance.

SAX always beats DOM at speed. But since you say XML files are small you may proceed with DOM parser. One thing you can do to speedup is create a Thread-Pool and do the database operations in it. Multithreaded updates will significantly improve the performance.

Lalith

Go with SAX, or if you want, StAX. Forget about DOM. Use an effective library like aalto.

I am sure that parsing will be quite cheap compared to making the database requests.

But 200k is not such a big number if you only need to do this once.

SAX will be faster than DOM, this could well be an issue if you have 200,000 files to parse.

StAX is faster then SAX and this is much faster then DOM. If performance is super critical you can also think about building a special compiler to parse the XML files. But usually lexing and parsing is not that much of an issue with StAX but the "after-processing".

继续阅读：xml xml-parsing

How can I efficiently parse 200,000 XML files in Java?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？