Scaling application that reads large XML files
I have an application which reads large set of XML files (multiple around 20-30) periodically, like once every 10 minutes. Now each XML file can be approximated to at least 40-100 MB in size. Once each XML has read, a map is created out of the file, and then the map is passed across a processor chain (10-15), each processor using the data, performing some filter or writing to database, etc.
Now the application is running in 32 bit JVM. No intention on moving to 64 bit JVM right now. The memory foot-print as expected is very high... neari开发者_运维技巧ng the threshold of a 32 bit JVM. For now when we receive large files, we serialize the generated map into disk and run through the processor chain maximum of 3-4 map concurrently as if we try to process all the maps at the same time, it would easily go OutOfMemory. Also garbage collection is pretty high.
I have some ideas but wanted to see if there are some options which people have already tried/evaluated. So what are the options here for scaling this kind of application?
Yea, to parrot @aaray and @MeBigFatGuy, you want to use some event based parser for this, the dom4j mentioned, or SAX or StAX.
As a simple example, that 100MB XML is consuming a minimum of 200MB of RAM if you load it wholesale, as each character is immediately expanded to a 16 bit character.
Next, any tag of elements that you're not using is going to consume extra memory (plus all of the other baggage and bookkeeping of the nodes) and it's all wasted. If you're dealing with numbers, converting the raw string to a long will be a net win if the number is larger than 2 digits.
IF (and this is a BIG IF) you are using a lot of a reasonably small set of Strings, you can save some memory by String.intern()'ing them. This is a canonicalization process that makes sure if the string already exists in the jvm, its shared. The downside of this is that it pollutes your permgen (once interned, always interned). PermGen is pretty finite, but on the other hand it's pretty much immune to GC.
Have you considered being able to run the XML through an external XSLT to remove all of the cruft that you don't want to process before it even enters your JVM? There are several standalone, command line XSL processors that you can use to pre-process the files to something perhaps more sane. It really depends on how much of the data that is coming in you're actually using.
By using an event based XML processing model, the XSLT step is pretty much redundant. But the event based models are all basically awful to use, so perhaps using the XSLT step would let you re-use some of your existing DOM logic (assuming that's what you're doing).
The flatter your internal structures, the cheaper they are in terms of memory. You actually have a little bit of an advantage running a 32b vm, since instance pointers are half the size. But still, when you're talking 1000's or millions of nodes, it all adds up, and quickly.
We had a similar problem processing large XML files (around 400Mb). We greatly reduced the memory footprint of the application using this:
http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc
You can insert the contents of each XML file into a temporary DB table and each chain link would fetch the data it needs. You will probably lose performance, but gain scalability.
精彩评论