开发者

Keeping data in memory, Design approach

I have a problem where in I need to process some files with size in the range of few kbs up to 1 GB max. The use case is such that the input is in some flat file format where the data is stored in a single line, say some payment instructions. Application has to go through each payment instruction and form groups based on some grouping logic. At the end the groups has to be converted to 开发者_运维问答another format(ISO 20022 xml) using which payment processing will be happening.

The current design is such that, we have two tables where grouping criteria data is stored in one table and individual payment instruction is stored in another table (One to Many relationship from group Table to Payment Instruction Table). And in Step 1 : as and when we go through the flat file we identify the group it belongs to, and write to the database(bulk commit btw).

In step 2: of batch processing the groups are read one by one and form the output xml and send to the destination.

The issue I'm facing now is that writing to two tables and fetching from that is an overkill, if the entire thing can be done in memory.

I'm thinking of an approach where in I can keep a HashTable(google guava (MapMaker)) kind of caching, and the size of which I can specify, and once the cache reaches the upper limit I can write them into database tables (weave an aspect on the put to cache).

In the same way while retrieving the entries I can first check in the cache for the key and if it not there, query the database.

What is your opinion on this design approach(Is it another blunder or something which I can make practical and at the same time stable, and can scale).

Why I thought of this is, we don't have big files coming in always, and we require these temperory tables only if we cannot process the entire file in memory and may lead to OutOfMemory problems.

Could you please give some suggestions ?

Thanks


I can't see that your caching needs are so exotic that you can't use off-the-shelf components. You could try Hibernate for accessing your database. It supports caching.


I think that your design sounds reasonable. However, there are a few things to keep in mind. First, are you sure that adding in the extra complexity is justified? That is, is the performance hit of writing to a bunch of files and then reading them back in an important bottleneck? If the wasted time isn't important, I would strongly caution you against making this change. You'd just be increasing the complexity of the system without much of a benefit. I assume that you've thought about this already, but just in case you haven't I thought I'd post that here.

Second, have you considered using memory-mapped files via MappedByteBuffer? If you're dealing with huge objects that are exceeding the Java heap space and are willing to put in a bit of effort, you might want to consider designing the objects so that they are stored in memory-mapped files. You could do this by creating a wrapper class that is essentially a thin wrapper that translates requests into operations in the mapped byte buffer. For example, if you want to store a list of requests, you could do so by creating an object that uses MappedByteBuffer to store a list of strings on-disk. The strings could be stored separated by newlines or null terminators, for example. You could then iterate across the strings by walking across the bytes of the file and rehydrating them. The advantage of this approach is that it offloads the caching complexity to the operating system, which has been performance-tuned for decades (assuming you're using a major OS!) to handle this case efficiently. I've worked on a Java project once where I built a framework to automate this, and it worked wonderfully in many cases. It's definitely a bit of a learning curve to get over, but once it works you can hold way more data in Java heap space than you could have before. This does essentially what you proposed above, except it trades a bit of up-front implementation complexity to let the OS handle all of the caching.

Third, is there a way to combine passes (1) and (2)? That is, could you generate the XML file at the same time that you're generating the database? I assume from your description that the issue is that you can't generate the XML until all of the entries are ready. However, you might want to consider creating several different files on disk that each store objects of one type in the serialized XML format, and could at the end of the pass use a standard command-line utility like cat to join them all together. Since this can be accomplished simply by doing bulk byte concatenation rather than having to parse the database contents, this could be much faster (and easier to implement) than your proposed approach. If the files are still hot in the OS cache (which they probably are, since you've just been writing to them) this might actually be faster than your current approach.

Fourth, if performance is your concern, have you considered parallelizing your code? Given staggeringly huge files to process, you could consider splitting that file into lots of smaller regions. Each task would then read from the file and distribute the pieces into the proper output files. You could then have a final process to merge together identical files and produce the overall XML report. Since I assume that this is a mostly I/O-bound operation (it's mostly just file reading), this could give you a much bigger performance win than a single-threaded approach that tries to keep everything in memory.

Hope this helps!


Have you had a look at Spring Batch which has support for processing flat files, splitting them by field values and parallel processing results. With Spring jdbc you could still store the grouping criteria in a database, but just process the file without having to use an intermediate table.


No, it's probably not worth the effort to do caching and the fall back to a (temporary?) table, mainly because it's going to be complex, increasing risks and costs.

However, There is a potential for speeding up the initial sorting into groups, and there is nothing that says you need to use a RDMS for that.

I suggest that you skip the homebrew caching, and use a persistent collection, ie a collection that is backed by a file on your local disk. This approach will most likely speed up both small and large files (compared to using a relational database.)

However, you should performance test... I's not certain that a half-decent java b-tree can beat a properly configured database server. But if the typical mismanaged database running on a slice of a crappy system, on the other end of a slow network, then there is absolutely a chance.

Google for persistent collections or nosql for java; Here are some that I know:

http://jdbm.sourceforge.net/ might be used as a "persistent/scalable" map. Maybe http://code.google.com/p/pcollections/ (but I have not tried it myself)

You should be able to find more; try and test :-)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜