Smart buffering in an environment with limited amount of memory Java
Dear StackOverflowers,
I am in the process of writing an application that sorts a huge amount of integers from a binary file. I need to do it as quickly as possible and the main performance issue is the disk access time, since I make a multitude of reads it slows down the algorithm quite significantly.
The standard way of doing this would be to fill ~50% of the available memory with a buffered object 开发者_如何学运维of some sort (BufferedInputStream etc) then transfer the integers from the buffered object into an array of integers (which takes up the rest of free space) and sort the integers in the array. Save the sorted block back to disk, repeat the procedure until the whole file is split into sorted blocks and then merge the blocks together. The strategy for sorting the blocks utilises only 50% of the memory available since the data is essentially duplicated (50% for the cache and 50% for the array while they store the same data).
I am hoping that I can optimise this phase of the algorithm (sorting the blocks) by writing my own buffered class that allows caching data straight into an int array, so that the array could take up all of the free space not just 50% of it, this would reduce the number of disk accesses in this phase by a factor of 2. The thing is I am not sure where to start.
EDIT: Essentially I would like to find a way to fill up an array of integers by executing only one read on the file. Another constraint is the array has to use most of the free memory.
If any of the statements I made are wrong or at least seem to be please correct me,
any help appreciated,
Regards
when you say limited, how limited... <1mb <10mb <64mb?
It makes a difference since you won't actually get much benefit if any from having large BufferedInputStreams
in most cases the default value of 8192 (JDK 1.6) is enough and increasing doesn't ussually make that much difference.
Using a smaller BufferedInputStream
should leave you with nearly all of the heap to create and sort each chunk before writing them to disk.
You might want to look into the Java NIO libraries, specifically File Channels and Int Buffers.
You dont give many hints. But two things come to my mind. First, if you have many integers, but not that much distinctive values, bucket sort could be the solution.
Secondly, one word (ok term), screams in my head when I hear that: external tape sorting. In early computer days (i.e. stone age) data relied on tapes, and it was very hard to sort data spread over multiple tapes. It is very similar to your situation. And indeed merge sort was the most often used sorting that days, and as far as I remember, Knuths TAOCP had a nice chapter about it. There might be some good hints about the size of caches, buffers and similar.
精彩评论